Cylc 7: Unexpected disagreement between global.rc settings and Cylc environment variables

Hi. A user of our Cylc 7-based software suite has had some tasks fail where files produced by earlier tasks weren’t being found, despite being manually verified as present. I did some debugging, and discovered the following going on:

  1. For reasons having to do with the architecture of their local platforms, the user “myuser” cannot use the default location for the Cylc run directory of /home/myuser/cylc-run/. Instead, in that user’s /home/myuser/.cylc/global.rc file, an alternate location is specified as

[hosts]
[[localhost]]
work directory = /foo/cylc-run
run directory = /foo/cylc-run

This setting causes /foo/cylc-run to be created, with the running suite’s full run directory below that. For reasons I don’t understand, the directory /home/myuser/cylc-run/ is also created in the user’s home directory, with a directory for the running suite below it, containing only a part of what’s expected in a run directory.

  1. A task in the suite dumps several relevant Cylc environment variables; the task is written in python, and it obtains these values via a call like os.environ[‘CYLC_SUITE_RUN_DIR’]. Those variables have surprising (to me) values which don’t completely match the user’s setting in global.rc and are inconsistent. I see:

CYLC_SUITE_RUN_DIR = /home/myuser/cylc-run/my_suite_run_name
CYLC_SUITE_SHARE_DIR = /foo/cylc-run/run_name/share
CYLC_SUITE_WORK_DIR = /foo/cylc-run/run_name/work
CYLC_TASK_LOG_DIR = /home/myuser/cylc-run/run_name/log/job/20230603T0000Z/some_task/01
CYLC_TASK_LOG_ROOT = /home/myuser/cylc-run/run_name/log/job/20230603T0000Z/some_task/01/job
CYLC_TASK_WORK_DIR = /foo/cylc-run/run_name/work/20230603T0000Z/some_task

Note that CYLC_SUITE_SHARE_DIR, CYLC_SUITE_WORK_DIR, and CYLC_TASK_WORK_DIR picked up on the change set in global.rc; but CYLC_SUITE_RUN_DIR, CYLC_TASK_LOG_DIR, and CYLC_TASK_LOG_ROOT did not, and are still using the default.

This breaks my suite, because tasks in the suite expect to be able to use CYLC_SUITE_RUN_DIR as a top-of-tree to find files in the share/ and log/ subtrees; but the value of CYLC_SUITE_RUN_DIR does not point to the actual run directory of the suite.

I’m hoping this is something I’m doing wrong and not a Cylc 7 issue fixed in Cylc 8, because unfortunately we will not be permitted to migrate to Cylc 8 any time soon.

What might I be doing wrong here?

Thanks!

I’ve not used Cylc 7 much, but FWIW -

  • Is the workflow running locally? If global.rc:[suite servers]run hosts is set to other servers then you may find that setting the work/run directories for localhost won’t do anything. If those servers share a file system this might produce the effect that you are seeing.
  • Are the items in /home/myuser/cylc-run symlinks?
  • What value does the variable CYLC_SUITE_HOST have?

I’m not sure what the problem is (we never use these settings because they are not compatible with Rose). I wonder if they work correctly for localhost.

Can you explain why the default location can’t be used?
Could they just symlink /home/myuser/cylc-run to /foo/cylc-run?

Hi folks. Thanks for your replies!

Re: symlinking /home/myuser/cylc-run to /foo/cylc-run, that’s actually what I do normally for my own work: on most of the machines on which I work, the filesystem with user directories is not large enough (or does not have enough free space) to support the amount of stuff our suite runs produce, so I symlink it to a directory on a data-focused filesystem. But in the user’s case, the default location can’t be used because our suite is being run in a large HPC environment and user home directories are not available on the compute nodes to which PBS-queued tasks are submitted (most of our tasks are background rather than run via PBS; but the heavy number-crunching tasks get handled by PBS).

The workflow is running locally; global.rc:[suite servers] is not set. CYLC_SUITE_HOST is set to the local login node upon which the suite run was started and where the Cylc server is running.

Re: the contents of the two run directories, there is something interesting going on. The contents of the one I don’t expect, /home/myuser/cylc-run/, are not symlinks: there are actual directories for suite runs. The contents of the unexpectedly present (because of those global.rc settings) run directories there are lib/ log/ and .service/ subdirectories, the cylc-suite.db symlink to log/db, and the suite.rc.processed file. That’s it: no share/ or work/ subdirectories. The contents of the one I do expect, /foo/cylc-run/, are also actual directories for (the same) suite runs; but the contents of those directories are a little odd. There, one can find the share/ and work/ subdirectories; but the lib/ and log/ directories as well as the suite.rc.processed file and the usual symlink cylc-suite.db are symlinks to the copies of those items in the other run directory in /home/myuser/cylc-run/ – which seems really, really strange to me.

The user runs other Cylc 7-based suites developed by other organizations from us, and always does it in this fashion, with work and run directory reassigned in global.rc. With those other suites, the problem we’re seeing (where task scripts can’t find files because CYLC_SUITE_RUN_DIR points to the wrong location) has never arisen. That encourages me that this problem may just be something I’m doing wrong and is simply solveable; but it may just be that the other suites don’t use CYLC_SUITE_RUN_DIR in their task scripts like we do and they’d have the same problem if they did, I dunno.

Thanks!

You can handle this kind of setup through a combination of symlinking the cylc-run directory and redefining a couple of environment variables using global-init-script. See Running cylc tasks on compute nodes that cannot see /home - #14 by dpmatthews