Setting up a Platform Config for machines that share a home directory

It might be possible to support this in the future but it would be a non-trivial enhancement. It would also lead to some duplication of files and additional work (remote installation and log retrieval) so would have some downsides.

That’s right.

Cylc won’t know anything about those directories but it doesn’t need to. The env-script becomes part of every job - you just need to define the appropriate logic (e.g. creation of directories, changing directory, setting environment variables).

Okay thanks @dpmatthews - I’ll look at creating a workflow implementation work around using these user defined scripts

As you say, I think I can easily come up with a way to clean the temporary scratch directories in the err-script or exit-script, but we may want to leave the temporary directories until a few cycle points ahead, so I’ll probably add some scripting to intelligently call cylc clean at the derised points. From looking at cylc clean it doesn’t look there is a way to inject some special consideration into cylc clean?

If not, I think we can add something to the scripting that will call cylc clean to delete the scratch directories. Obviously this would be a work around for now, but I’m thinking it should get us off to the races to test more aspects of our model pipeline.

So I’ve setup the following script

#!/bin/bash
# setup-scratch-space
export JOB_SCRATCH_SPACE=${WORK_SPACE}/tmp_${USER}_cylc_${CYLC_TASK_NAME}_${CYLC_TASK_CYCLE_POINT}_$(date +"%Y%m%d%H%M")
mkdir $JOB_SCRATCH_SPACE
ln -s $JOB_SCRATCH_SPACE
cd $JOB_SCRATCH_SPACE

and then added this to my flow.cylc:

[runtime]
   [[root]]
      env-script = setup-scratch-space
   
   [[find_diamonds]]
      platform = ppp3
      script = find_diamonds.sh

where find_diamonds.sh contains

sleep 15
echo "I'm on $(hostname) at $(pwd)" >> out.txt

and the ppp3 platform is defined as

[platforms]
   [[ppp3]]
        cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
        job runner = jobctl
        install target = localhost
        global init-script = """
            export WORK_SPACE=/space/hall3/sitestore/eccc/crd/ccrn_tmp
        """

Then after getting the flow to run, when I look at the resulting work directory for find_diamonds, I see that the JOB_SCRATCH_SPACE directory is created, along with the softlink to it, but the out.txt file is still created in runN/work/${CYLC_TASK_CYCLE_POINT}/find_diamonds and the message in it says the find_diamonds.sh script was called in the same directory. So it seems that the cd command in the env-script isn’t applying to where the task script runs?

Looking at the documentation it looks like it should be respected? Is there some other consideration I need here?

Also, what is the difference in use case for the init-script vs env-script? It looks like the init-script kind of runs outside of the normal task environment? Beyond that I’m not super sure on the differences.

Update: I’ve added a simple echo $(pwd) statement here to check the directory after each user script is executed, and it seems like the cd command is definitely not respected. Looking at the job.sh and the produced job file further, it looks like the individual user scripts are ran via them being on the PATH variable, which makes a lot of sense, however, this does track with my experience as cd in executable scripts don’t change the directory of the calling shell? This would be different if a source command would be used, but I’m not sure that is desirable as sourcing a bunch of scripts in succession can lead to environment/namespace problems. It’d likely be better for me to just add the cd $JOB_SCRATCH_SPACE to the main script

Here’s a small self-contained local-job example that shows the directory change is respected. So we just need to figure out what’s going on in your example.

[scheduling]
   [[graph]]
       R1 = foo
[runtime]
   [[root]]
       env-script = """
          cylc__job__wait_cylc_message_started
          WORKSPACE=/tmp/$USER/$CYLC_WORKFLOW_ID/$CYLC_TASK_JOB
          echo env-script 1...$PWD
          mkdir -p $WORKSPACE
          cd $WORKSPACE
          echo env-script 2...$PWD
       """

       [[[environment]]]
   [[foo]]
       init-script = "echo init-script: $PWD"
       script = """
          echo "script.........$PWD"
          echo "the quick brown fox" > out.txt
          cat ${WORKSPACE}/out.txt
       """

Run it…

cylc-src/clint$ cylc install && cylc play -n clint

… and check the result:

cylc-src/clint$ cylc cat-log clint//1/foo

init-script: /home/oliverh/cylc-src/clint
Workflow : clint/run3
Task Job : 1/foo/01 (try 1)
User@Host: oliverh@niwa-1007823l

2022-03-09T12:31:36+13:00 INFO - started

env-script 1.../home/oliverh/cylc-run/clint/run3/work/1/foo
env-script 2.../tmp/oliverh/clint/run3/1/foo/01
script........./tmp/oliverh/clint/run3/1/foo/01
the quick brown fox

2022-03-09T12:31:38+13:00 INFO - succeeded
1 Like

As the diagram in the documentation shows, an init-script can (e.g.) set variables that affect the Cylc-defined environment; and an env-script can affect the user-defined environment, for the job. Most users don’t use either, but they’re available if needed.

Hey @hilary.j.oliver I think I nailed down why the cd wasn’t being respected - I was trying to just run an executable script as part of env-script, i.e. with

env-script = setup-scratch-space

This of course won’t work to set any environment considerations/directory moves because it will be ran as an executable. But when I change it to env-script = source setup-scratch-space, it works as intended :smile: !

Thanks for all the help - I think @swartn and I have a good path forward now

1 Like

Haha, I was just about to say the same! As an executable, it runs in a subshell and does not affect the parent shell.

Hopefully you’re on the right track now :tada:

1 Like

For your use case I’m afraid it won’t be possible for cylc clean to remove your scratch dirs. cylc clean does not follow user-created symlinks, so it will remove the symlink itself but not touch the target (this is in contrast to the standard Cylc symlink dirs, which it does follow).

Thanks @MetRonnie - our plan for now is to just add a temporary function before we call cylc clean to get rid of the temporary directories. It should give us a work around for all the issues noted above until a more robust solution can be implemented

@Clint_Seinen and @swartn - regarding a more robust solution in the longer term:

It is tempting to suggest you have a strangely-configured platform (no offense :grin:) but I can see why some might view shared-home-only as a good idea, and in any case I imagine you have no control over that.

$HOME/cylc-run with underlying symlinks to data areas is very nice for several reasons: everyone (Cylc, users, and 3rd party applications) knows exactly where to find workflow files; and when interacting with remote platforms ssh automatically lands you in the home directory.

But that aside, we think it would be feasible to build in support for a configurable top-level cylc-run location per install target (e.g. /scratch/$USER/cylc-run instead of $HOME/cylc-run).

It wouldn’t be a trivial development, but this is an Open Source project so you’re welcome to propose a solution (via a new Cylc repository Issue) and, post discussion and agreement, work on implementation.

Regards,
Hilary

1 Like

If you really want to support this I’d suggest sticking with just being able to change the cylc-run bit (i.e. assume $HOME). Otherwise you hit the issue of not knowing $USER on the remote platform (or other environment variables defined on that platform).

1 Like

Thanks @hilary.j.oliver and @dpmatthews for all your help! @swartn and I are currently in a sprint to implement our entire model pipeline in cylc and making relatively fast progress now that we’ve figured out a way to handle the temporary directories. Once we get things in a good state we are happy to revisit things and we’ll see what we can do :smile:

That said, now that we’re moving on this, we actually like having everything in one cylc-run directory, even if some of the underlying links are only valid on certain machines - makes it really easy to see what is going on.

2 Likes

Thanks to everyone for their help in this thread. We have made good progress and have a complex model/diagnostic string running across multiple machines sharing their home directories, and we are ambitious about porting to other platforms, and seeing a broader uptake in our organization.

3 Likes

One small issue we have been experiencing is in the tui/gui, the updating across hosts. This question might belong in another thread. But what we see is that we have a host named daley, and it’s login nodes have a name like xc4elogin1. If we cylc play on daley, and then try to monitor results from a different machine, we get an error like ocket.gaierror: [Errno -2] Name or service not known: 'xc4elogin1'.

Indeed: socket.gethostbyname_ex('xc4elogin1') returns an error. However, socket.gethostbyname_ex('daley') functions as expected. Is there any configuration guidance that might help with this? I note if we play the workflow originally from the other machine, everything seems to work as expected.

That implies that hostname -f is returning xc4elogin1 but that name isn’t valid on the other hosts.
(See Https messaging fails, communication error between HPC and cylc hosts - #3 by oliver.sanders)
You can hardwire the name in the global config but that then limits you to always running on that host. You can also try using IP addresses instead of host names.
See Global Configuration — Cylc 8.0rc1 documentation

Alternatively you can configure which hosts you want the scheduler to run on and only choose ones which don’t suffer this problem: Global Configuration — Cylc 8.0rc1 documentation

1 Like

Perfect -thank you @dpmatthews