I’m going to deploy Cylc v.8.1.4 on a linux server and I would like to ask a few basic questions before proceeding.
The first question is related to the minimum and suggested hardware/software requirements (OS, CPU, RAM and so on…). Could you please help me to detect the needed requirements?
The second question is related to global.cylc file.
I would like to set workflow run directory location to a path different from the default $HOME/cylc-run.
I read that I should use the option mentioned below, but I need to do it without using static or shared path.
In other words, since on the same linux server will be defined multiple users belonging to different groups, I would like to force the workflow run directory location to something like /work/groupX/userY.
Do you think that it can be done? And, if it can be done, how can I reach the target configuration?
Firstly, the latest release is 8.2.1 - you should start with that.
OS - any modern Linux system will do.
CPU and RAM - it depends on the size of and number of your workflows. If you have many users, you should consider using a small pool of Cylc VMs. Then the scheduler processes themselves will not impact the performance of the interactive nodes, and you can easily scale up (add more VMs) if you need to. Cylc can automatically manage placement of schedulers on such a pool; it is transparent to users. At my site, we typically have 70-80 workflows running on 3 small VMs.
Yes, that is done with the symlink dirs config. The standard $HOME/cylc-run location makes all workflow output easily accessible from a known location, but it can be automatically symlinked to other disk areas.
Yes, you can do this.
Does userY only belong to groupX (for the purpose of workflow run directories, at least) or does each user have to run workflows in multiple groups?
One group per user is easy: symlinking config can be defined in the central (as opposed to user) global.cylc based on an environment variable ($GROUP say) that users export in their login scripts.
If it needs to be set dynamically according to which workflow the user is running, that’s trickier but still possible (we do it at my site). Can you let me know if that’s the case, before we go down that route?
many many thanks for such detailed and clear reply.
I really appreciated it.
You can find below the answer to your question.
Yes, generic user (let’s say “userY”) belongs to one primary group “groupX” (the group identifies the research division the user belongs to). The user belongs to other secondary groups, but I think it is not relevant for my case since the user should run the flows in a single group scope.
(To see if you’ve got global config syntax right, just run the cylc config command, which by default parses the site and user global config files).
As a user I just need to set $GROUP in my environment.
$ cylc install demo
INSTALLED demo/run1 from /home/oliverh/cylc-src/demo
$ ls -o ~/cylc-run/demo
drwxr-xr-x 2 oliverh 4096 Aug 28 19:29 _cylc-install
lrwxrwxrwx 1 oliverh 43 Aug 28 19:29 run1 -> /tmp/work/groupX/oliverh/cylc-run/demo/run1
lrwxrwxrwx 1 oliverh 4 Aug 28 19:29 runN -> run1
With a single group per user, you could avoid the requirement for users to export their own GROUP value if that was automatically set in their default environment, OR with Jinja2 code (and potentially custom Python module) in the site global.cylc to programmatically determine the group according to the value of $USER.
If you have job platforms that don’t see the same filesystem as localhost, define symlinking for the associated install targets too.
thanks to your explanation I had been able to solve any issues.
Now, if I can, I would like to ask how I can restart a yet executed suite/flow from beginning.
I read that:
If a workflow gets stopped or killed it can be restarted, but to avoid corrupting an existing run directory it cannot be started again from scratch (unless you delete certain files from the run directory). To start from scratch again, install a new copy of the workflow to a new run directory."
Do I really need to install a new copy every time I need to relaunch the suite/flow (both using command line both using GUI)?
If I’m not wrong, using the old version 7.8.4, I can restart the suite from the beginning without problem and without installing it again.
Thank you very much MetRonnie, I really appreciated your help.
If it is possible, I have a last question to do:
We created a dedicated virtual machine hosting Cylc.
From this VM, we are able to submit a suite/flow against a remote HPC cluster login node.
The login node and the VM doesn’t share any file system.
And I think that for this reason the job is submitted, it runs as expected but at the end it fails with the following error (since the mentioned path doesn’t exist on the login node).
/users_home/user06/.lsbatch/1693392122.166197.shell: line 56: /users_home/user06/cylc-run/test/run4/.service/etc/job.sh: No such file or directory
/users_home/user06/.lsbatch/1693392122.166197.shell: line 57: cylc__job__main: command not found
Is there a way to fix this issue without sharing the file system?
Please note that Cylc is installed only on the VM
many thanks, you helped me again to solve my issue.
Now I have the last one to solve
The workflow is correctly sent from CYLC virtual machine to HPC cluster login node.
The job is correctly submitted by the login node (and it is executed on the compute node).
The job terminates but with the following error:
CylcError: Cannot determine whether workflow is running on (IP address of CYLC VM).
Cylc scan shows that the flow is still running, but the job is actually ended.
Is there something we can do to fix this behaviour?
Last question, I promise: do we need to install cylc packages also on the HPC cluster login node or only on the CYLC virtual machine?
Which log file did you see that error message in? Can you post a few lines of context (before & after) from the log file?
P.S. for clarity can you make sure you are using the correct terminology (the terms “flow”, “workflow” and “job” mean different things) - see the glossary. I assume you mean “Cylc scan shows that the workflow is still running, but the workflow is actually ended”?
Could you please show me a “standard” global.cylc file with the most common content widely used by CYLC standard deploy. I know for sure that I will add the second login node in the above configuration. Is there anything else I should add to make cylc working as other users do?
I’m not sure there is a “typical” global config, it’s very site dependent and many of the settings are only needed in adverse circumstances (such as, network settings don’t allow TCP comms from job hosts so you have to tell Cylc to use polling to track job status).
The command cylc config --defaults will show all possible settings.
Note that many global settings, such as those for event handlers, can be configured by users rather than centrally. (A user global config adds to or overrides central settings, and affects all of the user’s workflows).
For central config, the most important thing is job platform definitions. If job platforms aren’t on the same filesystem you should probably configure automatic job log retrieval.
(Feel free to continue to ask questions ) I guess you must have figured this out, but installation requirements are described here: