An RSE friend demoed this for me today: https://cluster-in-the-cloud.readthedocs.io/en/latest/index.html - quickly and easily spin up a heterogenous elastically-scaling Slurm cluster in the cloud, provisioned by Ansible and with slick monitoring by Granfana. It uses terraform so the same recipe should work for all the main cloud providers (and can be extended to support others).
We already support Slurm, so I imagine it should(?) be pretty straightforward to do this for complete end-to-end Cylc workflows. Elastic scaling is the key because resource requirements typically vary wildly as our workflows progress. Slurm knows the full cluster resource available, but spins up (and takes down) nodes on demand, for each individual job that it executes - so you pay only for the compute you use (not for the resource maximum times the execution time for the entire workflow),
Worth a look?
p.s. presumably PBS has similar “elastic scaling” capability too, but I haven’t checked yet…