Symlink dirs not working for slurm job request out of localhost

Hi there everyone.

We are new to cylc and trying to sort out how to submit tasks via slurm jobs and running into an issue when using --nodelist slurm directive. We have things working for slurm jobs that coincidentally run on the same host we are running cylc install, but not when we request a different node.

Our platform config:

[platforms]
    [[kapua]]
        hosts = localhost
        job runner = slurm
        install target = localhost

[install]
    [[symlink dirs]]
        [[[localhost]]]
            run = /hub/scratch

Rationale and context: This is an experimental cluster with kap-xeon02 configured as a slurm controller and kap-xeon02 and kap-epyctest as slurm nodes. Cylc + Cylchub is installed in kap-xeon02, and we are installing cylc workflows directly there. We are using kap-xeon02 as a “login node”). The reason to use symlink dirs is that these are not headless nodes, so home directories are not in a shared filesystem, but /hub/scratch is.

The workflow file is:

[scheduler]
    UTC mode = True

[scheduling]
    initial cycle point = now
  
    [[xtriggers]]
        delay3h = wall_clock(offset=PT3H)
        delay8h = wall_clock(offset=PT8H)

    [[graph]]
        # repeat every 6 h starting at the initial cycle point
        PT12H = """
            @delay8h => roms_saus2d_pre => roms_saus2d_run => roms_saus2d_post 
        """

        # Repeat every 12 h starting 12 h after the initial cycle point.
        +PT12H/PT12H = """
            @delay8h => roms_saus2d_post[-PT12H] => roms_saus2d_pre => roms_saus2d_run => roms_saus2d_post
        """

[runtime]
    [[roms_saus2d_pre]]
        script = roms_preproc.sh
        [[[environment]]]
            MODEL_CONFIG = model.roms_gfs_saus2d.yml
    
    [[roms_saus2d_run]]
        script = roms_run.sh
        platform = kapua
        [[[environment]]]
            ROMSBIN = ROMS2D_NO-TIDE_NO-OBC
            SCRATCH_DIR = /hub/scratch/forecast/roms/gfs_saus2d
            SLURM_TASKS = 12
        [[[directives]]]
            --mem = 8G
            --ntasks = 12
            --cpus-per-task = 1
            --propagate = STACK
            --partition = mainQ
            --output = /hub/scratch/forecast/roms/gfs_saus2d/roms.out
            --error = /hub/scratch/forecast/roms/gfs_saus2d/roms.err
            --nodelist = kap-epyctest

    [[roms_saus2d_post]]
        script = roms_postproc.sh
        [[[environment]]]
            MODEL_CONFIG = model.roms_gfs_saus2d.yml

Everything runs fine without the --nodelist = kap-epyctest directive. The slurm job is submitted to kap-xeon02 as there are enough resources and the workflow succeeds. But when we add --nodelist = kap-epyctest, we get this error on job submission.

 cat  /hub/scratch/forecast/roms/gfs_saus2d/roms.err
/var/lib/slurm-llnl/slurmd/job00311/slurm_script: line 65: /home/metocean/cylc-run/ops-tiny/run1/.service/etc/job.sh: No such file or directory
/var/lib/slurm-llnl/slurmd/job00311/slurm_script: line 66: cylc__job__main: command not found

I expected the symlink dirs to fix that, but didn’t work. We have this global.cylc file in /etc/cylc/flow/global.cylc in both servers.

Is this the correct way of configuring the platform? Maybe cylc expects home directories to be in a shared filesystem?

Thanks in advance! :slight_smile:

Yes, Cylc expects all the nodes in a platform to share home directories. The problem is that, although Cylc supports setting up symlinks to refer to different filesystems, the locations are still referenced via $HOME. However, we have seen this kind of problem before so you should be able to workaround it by adding the following to your platform config:

[platforms]
    [[kapua]]
        global init-script = "export HOME=/hub/scratch"

I’m rather surprised your job run at all - I would have expected Slurm to complain about the location of the stdout / stderr files. If you encounter this problem you can fix it using the Slurm directive --chdir=/hub/scratch.

Note that, if you need to support multiple users, you’re going to need to use a user specific directory, e.g. /hub/scratch/${USER}

1 Like

Documentation link Platform Configuration — Cylc 8.4.2 documentation

1 Like

Thanks for your quick reply @dpmatthews. Yes, I believe we had it working in the localhost because we indeed had the --chdir set.

Sounds like global init-script = "export HOME=/hub/scratch" is exactly what we need. We’ll try that. Many thanks!

1 Like