Setting up a Platform Config for machines that share a home directory

Okay, I posted about this error here and have been attempted to figure it out, but no dice so far.

Summary of the problem

I’m trying to set up a platform configuration to run codes on internal machines with the following details:

  • there are 4 internal machines that all share a home directory
  • two of them have access to raw PBS commands (the “back end” machines), while the other two only have access to a wrapper around them (the “front end” machines - all four machines have access to this wrapper (called jobctl which was implemented to allow for remote submission)

As of now, I am trying get the simple flow to run:

[scheduler]
    UTC mode = True

[scheduling]
    # Stop the workflow 6 hours after the initial cycle point.
    initial cycle point = 2000-01-01T00
    final cycle point = +PT6H
    [[graph]]
        PT1H = """
            find_diamonds => sell_diamonds
            sell_diamonds[-PT1H] => find_diamonds
        """

[runtime]
    [[find_diamonds]]
        platform = ppp3
        script = find_diamonds.sh
        [[[directives]]]
            -l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
            -q = development

    [[sell_diamonds]]
        platform = ppp3
        script = sell_diamonds.sh
        [[[directives]]]
            -l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
            -q = development

where the ppp3 platform is defined as

[platforms]
    [[ppp3]]
        cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
        install target = ppp3
        hosts = eccc1-ppp3, eccc2-ppp3, eccc3-ppp3
        job runner = jobctl

and jobctl.py is:

from cylc.flow.job_runner_handlers.pbs import PBSHandler

class JOBCTLHandler(PBSHandler):
    """
        Given that jobctl just is a wrapper around PBS, we _should_
        only need to alter the commands used as the PBS directives
        should be supported
    """
    POLL_CMD        = "jobst"
    KILL_CMD_TMPL   = "jobdel '%(job_id)s'"
    SUBMIT_CMD_TMPL = "jobsub '%(jobs)s'"

JOB_RUNNER_HANDLER = JOBCTLHandler()

when I try to install/play the flow, I get the following error:

    STDOUT:
        REMOTE INIT FAILED
        Unexpected key directory exists: /home/rcs001/cylc-run/batch-test/run12/.service/client_public_keys Check global.cylc install target is configured correctly for this platform.

After wrestling with this for a while, thinking it was a problem with the way the ssh command was working, I realized that this might be caused by the fact that these machines (including the workflow host) all share a $HOME directory, and thus the cylc-run directory already exists, which would explain why its failing due to the client_public_keys file already existing?

I tried looking at the source for cylc remote-init and it looks to me like its trying to effectively install files on the remote host? Looking through the platform config documentation, it looks like you can define a shared filesystem, but the example seems to assume that the shared filesystem is outside the $HOME directory, and thus the cylc-run directory still needs to be created?

Is there a way I can setup the configuration so I can use remote machines/submission, but make it know the $HOME directories are shared?

Hi @Clint_Seinen

I think we might need a new User Guide section on this, to make it easier for new users (if so, sorry about that) … looks like it is only in the Reference section of the docs. (This relates to new features in Cylc 8, so the docs are very new).

Cylc installs workflow files onto platforms, which typically do have shared filesystems of course. We don’t want to install onto one host then re-install over the top of the same files via another host.

You just need to define install targets for your platforms in the global config. It’s documented in the Reference section here: Platform Configuration — Cylc 8.0rc1 documentation

Let us know if it’s still not clear.

thanks @hilary.j.oliver

In classic fashion, shortly after posting about this, I thought to try setting install target = localhost for these platforms, and it seems I’m not progressing to new errors! Baby steps, but its progress.

Re: the platform configuration - yeah, I think a detailed user guide section on this would be great. I did find the platform configuration section in the reference section, but was still confused by it. That said, given (by definition) that the platform configuration needs to be unique for each platform that someone might be running on, I can understand that its hard to make it comprehensive!

It wasn’t until I looked at the source code that I truly understood what install target meant, but after that, things became clear.

1 Like

Hey @hilary.j.oliver - so is setting install target = localhost the right way to handle this? This allowed us to progress but we now want to get larger jobs to run, which means we don’t jobs to run in the home directories on these machines.

As such, looking here it seems like I should be able to set this in global.cylc:

...
[install]
   [[symlink dirs]]
      [[[ppp3]]]
           work = /path/to/large/scratch/space

but then I realize this might only have an effect if cylc install is ran on the ppp3 platform at run time, no? Which won’t happen if I set install target = localhost?

Note that the scratch space is different for all our machines, so we’d like to be able to set a different location for each one.

Basically, what we want to achieve is shared location for the main “run directory”, but then want to use platform specific locations for the given tasks, depending on what platform is used.

The cylc install command installs workflow files to a new run directory, on the scheduler platform.

Workflow files gets installed onto a job platform the first time a job needs to be submitted there.

You should set install target = localhost for any job platforms that share a filesystem with the scheduler host. Then Cylc will know it does not need to install files (again) to the job platform.

If a job platform does not share a filesystem with the scheduler host, then you should set a different install target for that platform (or, let it default to the platform name if that works).

The run directory (and/or run sub-directory) symlinks allow us to store files workflow files (and data generated by workflow tasks) elsewhere, whilst still being easily accessible via the standard ~cylc-run locations. You can indeed symlink to different locations on different platforms. Again, this gets set up on job platforms the first time a job needs to run there.

Here’s a small example, to see how it works. My workflow has a local task and a remote task on a different filesystem. Both tasks run a script in the workflow source bin directory, so that has to be installed on both platforms.

The workflow:

$ cat flow.cylc
[scheduling]
   [[graph]]
      R1 = "cat => dog"
[runtime]
   [[root]]
      script = hello.sh
   [[cat]]
      platform = local
   [[dog]]
      platform = remote

The source directory:

$ tree ~/cylc-src/example
/home/oliverh/cylc-src/example
├── bin
│   └── hello.sh
└── flow.cylc

The global config:

$ cylc config
[install]
    [[symlink dirs]]
        [[[localhost]]]  # install target
            run = /tmp/${USER}/CAT
        [[[hpc1]]]  # install target
            run = /tmp/${USER}/DOG
[platforms]
    [[local]]
        hosts = localhost
        install target = localhost  # (default)
    [[remote]]
        hosts = hpc1, hpc2
        install target = hpc1

If you adapt this to your system and run it, you should see the local run directory files and symlinks appear immediately on cylc install, and the remote ones at runtime, just before the second (dog) task runs.

Thanks for the response @hilary.j.oliver - I’m wading through this with @Clint_Seinen. I think part of the struggle here is that our systems do not fall cleanly into the “share a filesystem” or not categories. They do share $HOME, but they do no share $SCRATCH.

To specify different run locations, we need to specify different install targets. But if we do that, job submission fails with a key error, due to the common $HOME (i.e. $HOME/cylc-run). If we specify a common install target then submission works, but there is no clear means to assign different run/work directories for the different platforms.

Perhaps I’m missing something, but as far as I understand the example above, there are two install targets (localhost / hpc1), and therefore two different assumed homes. What we seem to need are two install targets sharing a common $HOME, or alternatively, a way to specify unique work directories for different platforms sharing a common install target.

Admittedly our hpc system is quirky in that the $HOME filesystems are shared while other filesystems are not.

Our global config might look like:

    [[daley]]
        hosts = xc4elogin1
        cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
        job runner = pbs
        install target = daley
    [[ppp4]]
        cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
        hosts = eccc1-ppp4, eccc2-ppp4, eccc3-ppp4
        job runner = jobctl
        install target = ppp4
    [[hpcr-vis]]
        hosts = hpcr3-vis1
        cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
        install target = localhost 

[install]
    [[symlink dirs]]
        [[[daley]]]
            run = "/space/hall4/work/eccc/crd/ccrn/users/ncs001/"
        [[[ppp4]]]
            run = "/space/hall4/sitestore/eccc/crd/ccrn/users/ncs001/"
        [[[localhost]]]
            run = "/home/ords/crd/ccrn/ncs001"

However, as mentioned, this would fail with an error like:

Unexpected key directory exists: /home/ncs001/cylc-run/pets/run1/.service/client_public_keys

Just a note @swartn and @hilary.j.oliver - that I think that under @swartn’s scenario the Unexpected key directory error is coming from STDOUT from STDERR @swartn should be seeing something like

Unable to create symlink to /space/hall4/work/eccc/crd/ccrn/users/ncs001/cylc-run/pets/run1. The path /home/ncs001/cylc-run/pets/run1

OK, interesting. So you have platforms that share /home only, but no room for a lot of data there?

Can you get by for a short while without symlinking, for testing purposes?

Seems to me we should be able to support that, and perhaps provide a temporary workaround. Unfortunately I’m out for the rest of the day though … you might have to wait to the UK team comes on line (it’s currently midnight there).

Yes exactly @hilary.j.oliver. We’re still in early stages of testing here, but we’ve gotten the model running using a single platform. I think I can mess with setting the global init script (as noted here) and try to set a unique CYLC_RUN_DIR for each platform which might gives us a way to handle this. I will report back if it works.

Okay - no dice, but I think I’m misunderstanding some of these constructs.

In summary, I tried this:

# ~/.cylc/flow/global.cylc
[platforms]
   [[ppp3]]
      cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
      job runner = jobctl # note that I've written a job handler cap for the system's unique queuing commands
      install target = ppp3
      global init-script = """
            export CYLC_RUN_DIR=/home/${USER}/cylc-dirs/ppp3/cylc-run
        """
   [[banting]]
      cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
      job runner = pbs  # this platform uses raw pbs
      install target = banting
      global init-script = """
            export CYLC_RUN_DIR=/home/${USER}/cylc-dirs/banting/cylc-run
        """

[instal]
   [[symlink dirs]]
      [[[ppp3]]]
         run = /home/${USER}/cylc-dirs/ppp3
      [[[banting]]]
         run = /home/${USER}/cylc-dirs/banting

with the hope that this would result in unique cylc-run directories under the cylc-dirs, i.e.

|- ~/cylc-dirs/
      |-ppp3/
         |- cylc-run/
      |-banting/
         |- cylc-run/

and stop the clashing of shared files in the cylc-run directory, but I still get the same errors noted above.

As noted, I was trying to mimic something like this but after looking at the source, it seems like CYLC_RUN_DIR is only used in the job.sh file? And not in the remote init script? I note that the linked example looks to have been built for clusters where the compute node doesn’t have a $HOME directory, so this was mostly just a shot in the dark :slight_smile:

Note that if source has to be changed to allow robust solution, we’re happy to help in any way (i.e. make changes and submit a PR)!

Having a quick think about this in some down-time …

I’m a bit out of my depth here, but presumably a symlink is a filesystem object/property, not an OS one.

If so, that would suggest it’s not possible for the same location under /home to be symlinked to different locations depending on which host you’re logged in to.

And if so, does that mean all symlinks pointing out of /home on your system are broken unless you happen to be logged into the right host??

UPDATE: as a sort of bass-ackwards test of this, it certainly is the case that on a “normal” shared FS cluster, symlinks pointing to node-local storage (e.g. somewhere under /tmp) have the same value when accessed from other nodes, and are thus broken there.

Thanks, that’s great. If my previous comment makes any sense we might have to put some more thought into how to support a system like this, first.

1 Like

@hilary.j.oliver, yes correct. If I create a symlink in $HOME to what is $SCRATCH on a given machine, that link is only valid when logged into that machine. The real situation is actually more complex across our broader group of machines, but the statement above is true in some cases and the reason we can’t just list all as the same install target.

The shared home across different machines is certainly non-standard. It is quite strange indeed when you think the machines even run different os’s, and have quite different properties, but users only have a single home, and ~/.profile. As a result the admins have had to engineer a bespoke mechanism for defining separate environments for each machine - that automagically get sourced under ~/.profile - it is basically a fancy if machine then switch.

With Cylc 8 there is a built-in assumption that all the workflow files are installed in $HOME/cylc-run (with symlinks to other filesystems as necessary). Any symlinks you configure need to work on all the platforms that share that $HOME. So, for the moment at least, I don’t think you can use the built-in Cylc symlink support with your setup.

To avoid putting all the files in $HOME I think you’re going to need to tackle this at workflow runtime. For example, you could define a standard env-script (in the root family) which creates any necessary directories based on the host name (possibly in combination with err-script and exit-script to do any required clean-up).
https://cylc.github.io/cylc-doc/latest/html/user-guide/task-implementation/job-scripts.html#task-job-scripts
The main disadvantages of this are:

  1. This has to be defined as part of each workflow definition - it’s not something that can be configured globally.
  2. Cylc doesn’t know about these locations so they get won’t removed by cylc clean.
  3. The log files are still going to have to go into $HOME but hopefully it is big enough to cope with that?

Does that make sense?
Perhaps we can support this kind of setup better in the future but, at the moment, it’s not obvious to me how best to do this.

1 Like

If the ppp3 and banting install targets share the same /home/, this cannot work. Your expected file tree under ~/cylc-dirs would be correct for the symlink targets, however the symlinks themselves are both trying to be ~/cylc-run/your-workflow, assuming you’ve using both platforms in your workflow.

I.e. Cylc will be trying to do

~/cylc-run/your-workflow -> ~/cylc-dirs/pp3/cylc-run/your-workflow
~/cylc-run/your-workflow -> ~/cylc-dirs/banting/cylc-run/your-workflow

which is impossible.

This is what Hilary was talking about when he said

it’s not possible for the same location under /home to be symlinked to different locations depending on which host you’re logged in to

1 Like

Thanks @dpmatthews. Yep that makes sense, the log files could likely live in $HOME, and I think the downsides you note seem manageable. It is not immediately clear to me though what we would need to do to stop cylc default behaviour in attempting to create the links at install time - any tips appreciated.

On an alternative tack, we note as you say that use of $HOME/cylc-run is hard-coded. Excuse my complete ignorance about cylc internals at this point, but could we attempt to allow _CYLC_RUN_DIR to optionally be set in the global config, on a per-platform basis? It sounds from other threads on archer2 like this was something that was somewhat possible in cylc7, but the reasons it was an issue and hard-coded in cylc8 are not clear. thanks

could we attempt to allow _CYLC_RUN_DIR to optionally be set in the global config, on a per-platform basis

This was partially supported at Cylc 7, however, was difficult to support and made aspects of Cylc’s internal logic complex.

Cylc has to interface with remote platforms, transfer files between them, etc. It makes things a lot easier for Cylc if the filepaths are relative to $HOME as this makes them portable.

The archer2 like problems can be solved using the “symlink dirs” feature (the bulk of the issue with supporting archer2 was the dependence on $HOME in the Cylc job script which we have now resolved).

@swartn Cylc is only trying to create symlinks because that is what you’ve configured in global.cylc - remove the [[symlink dirs]] section and it will revert to just creating normal directories.

1 Like

I think we can live with “relative to $HOME” - and keep that simplicity. If we could just (optionally) add a platform name between $HOME and cylc-run, it naively seems like it might be able to solve this issue, as they would be no conflicting links.

@dpmatthews - right of course, we’re very green. thanks. So I guess the idea is we call all platforms the same install target as they share home, then manually configure the work directories as you suggest. We will have to figure out how to get cylc to use those directories though, or how to make sure the relevant paths get to the scheduler.