Wrong rsync command

Dear cylc community,

Sorry in advance for the long post but being a new user, I don’t know into how much detail I needed to go.

I’m trying to setup cylc-flow for groups in my institution. The documentation was sufficient so far but I’m now stuck with an rsync error I found in the log. So here is my set up so far:

  • I’m running cylc version 8.0.1
  • I define the workflow on my laptop
  • I’d like to run the tasks on an HPC machine with host name daint equiped with the slurm scheduler.
  • cylc is installed on daint
  • My global.cylc file looks like this
[platforms]
    [[daint]]
        hosts = daint
        job runner = slurm
        install target = daint
        retrieve job logs = True
        communication method = ssh
        cylc path = /users/leclairm/.local/bin
    [[daint_bg]]
        hosts = daint
        job runner = background
        install target = daint_bg
        retrieve job logs = True
        communication method = ssh
        cylc path = /users/leclairm/.local/bin

[install]
    source dirs = ~/cylc-src, /project/pr133/leclairm/cylc-src, /project/g110/cylc-src
    [[symlink dirs]]
        [[[daint]]]
            run = /scratch/snx3000/leclairm/
        [[[daint_bg]]]
            run = /scratch/snx3000/leclairm/     
  • My flow.cylc, named ROMSOC, looks like this:
[scheduler]
    UTC mode = True
    allow implicit tasks = True
[scheduling]
    initial cycle point = 2000-01-01T00Z
    final cycle point = 2000-05-01T00Z
    [[graph]]
        R1 = """
           Prepare_input_data_for_ROMS => ROMSOC
           Extpar => INT2LM
        """
        P1M = """
            download_data_from_DKRZ => INT2LM
            clean_COSMO_input_data[-P2M] => download_data_from_DKRZ
            INT2LM => ROMSOC => Post_Process_COSMO & Post_Process_ROMS & clean_COSMO_input_data
            ROMSOC[-P1M] => ROMSOC
        """
[runtime]
    [[root]]
        platform = daint
        [[[directives]]]
            --constraint = gpu
            --account = g110
    [[Prepare_input_data_for_ROMS]]
        script = """
            echo "running Prepare_input_data_for_ROMS"
            sleep 1
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:01
    [[ROMSOC]]
        script = """
            echo "running ROMSOC"
            sleep 10
            """
        [[[directives]]]
            --nodes = 2
            --time = 00:00:05
    [[Extpar]]
        script = """
            echo "running Extpar"
            sleep 2
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:01
    [[INT2LM]]
        script = """
            echo "running INT2LM"
            sleep 2
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:02
    [[download_data_from_DKRZ]]
        script = """
            echo "running download_data_from_DKRZ"
            sleep 2
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:02
    [[clean_COSMO_input_data]]
        script = """
            echo "running clean_COSMO_input_data"
            sleep 1
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:01
    [[Post_Process_COSMO]]
        script = """
            echo "running Post_Process_COSMO"
            sleep 2
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:02
    [[Post_Process_ROMS]]
        script = """
            echo "running Post_Process_ROMS"
            sleep 2
            """
        [[[directives]]]
            --nodes = 1
            --time = 00:00:02

cylc install ROMSOC is able to install the workflow on my machine. Then cylc run ROMSOC starts and creates the folder hierarchy and symlinks on the daint platform. Then it fails with the following error I found on the daint platform in ~/cylc-run/ROMSOC/run1/log/scheduler:

2022-08-25T14:50:22Z ERROR - platform: daint - initialisation did not complete
    COMMAND:
        rsync --delete \
            --rsh=ssh -oBatchMode=yes -oConnectTimeout=10 \
            --include=/.service/ --include=/.service/server.key -a \
            --checksum --out-format=%o %n%L --no-t --exclude=log \
            --exclude=share --exclude=work --include=/app/*** \
            --include=/bin/*** --include=/etc/*** --include=/lib/*** \
            --exclude=* /home/matth/cylc-run/ROMSOC/run1/ \
            daint:$HOME/cylc-run/ROMSOC/run1/
    RETURN CODE:
        11
    STDERR:
        rsync: mkdir "/users/leclairm/$HOME/cylc-run/ROMSOC/run1" failed: No such file or directory (2)
        rsync error: error in file IO (code 11) at main.c(664) [Receiver=3.1.3]

Obviously, the rsync command should not include $HOME. Am I missing a configuration step or is that a bug?

I think I found the way how the '$HOME' string made its way into the rsync command. The 2 first lines of cylc.flow.pathutil read like this:

# monkeypatching:
_CYLC_RUN_DIR = os.path.join('$HOME', 'cylc-run')

The _CYLC_RUN_DIR variable is then used to construct the string returned by get_remote_workflow_run_dir which is used by cylc.flow.task_remote_mgr.TaskRemoteMgr.file_install to construct the dst_path variable for the rsync command.

Somehow my rsync version doesn’t play well with this way of specifying the destination.

Hello @Matthieu

Welcome and thanks very much for your post. We are currently looking into this.

As a side note, there is a small error in your global.cylc which will need correcting to avoid problems later down the line. The install target for daint and daint_bg should be the same. So change required:

[platforms]
    [[daint]]
        hosts = daint
        job runner = slurm
        install target = daint
        retrieve job logs = True
        communication method = ssh
        cylc path = /users/leclairm/.local/bin
    [[daint_bg]]
        hosts = daint
        job runner = background
        install target = daint         # this tells cylc that daint_bg shares a file system with daint
        retrieve job logs = True
        communication method = ssh
        cylc path = /users/leclairm/.local/bin

We will update the documentation to make this clearer.

I will keep you posted when a fix becomes available for you.

Best wishes

Mel

Actually, I don’t really get why this whole machinery to get remote run dirs is necessary because the file hierarchy under ~/cylc-run is replicated anyhow from the origin to the host under the same root ~/cylc-run. So the resulting rsync command shoud just be

rsync [...] /absolute/user/home/cylc-run/workflow_name/run1 host:cylc-run/workflow_name/run1

Maybe I’m wrong but I also don’t see how the original rsync command could work on any system as the $HOME expansion can only happen on the origin side.

Thanks a lot for this Mel!
This makes a lot of sense now, I was a bit puzzled by this install target setting, mostly because of the word “file system” in the docs.

Yes, we are a little confused as to how it is working for us and hasn’t been spotted before!

Note that sections can be repeated in the global config which is useful for simplifying the platforms config. I think the following should work:

[platforms]
    [[daint, daint_bg]]
        hosts = daint
        install target = daint
        retrieve job logs = True
        communication method = ssh
        cylc path = /users/leclairm/.local/bin
    [[daint]]
        job runner = slurm

Also, assuming the cylc command is in your $PATH when you login to daint, you shouldn’t need to configure cylc path (all remote commands use a login shell by default).

Thanks a lot for the config tips.

And yes, I figured out for the cylc path. I was just trying a few things out before narrowing down the issue.

By the way, can I also merge the [install][symlink dirs][daint] and [install][symlink dirs][daint_bg]? So that the whole thing would finally be:

[platforms]
    [[daint, daint_bg]]
        hosts = daint
        install target = daint
        retrieve job logs = True
        communication method = ssh
    [[daint]]
        job runner = slurm

[install]
    source dirs = ~/cylc-src, /project/pr133/leclairm/cylc-src, /project/g110/cylc-src
    [[symlink dirs]]
        [[[daint, daint_bg]]]
            run = /scratch/snx3000/leclairm/

The [symlink dirs] section refers to install targets and both platforms are now sharing a common install target so all you need is:

    [[symlink dirs]]
        [[[daint]]]
            run = /scratch/snx3000/leclairm/

Ah true, thank you.
Obviously, still need to get used to the install target concept …

Looking at your rsync issue, it’s turns out we’ve been using this for a long time without anyone reporting a problem. I wonder what’s different about your environment?
Can you confirm that it really is the $HOME causing the problem.
For example, try something like:

$ touch test_file
$ rsync test_file 'daint:$HOME'

Does that fail in the same way?
I’ve tested this on a variety of remote systems without any problem.

Yes it’s failing. If I don’t use quotes $HOME is getting expanded on the origin side:

❯ rsync test_file daint:$HOME                                                
rsync: mkstemp "/home/.matth.8OAzy0" failed: Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1327) [sender=v3.2.5pre2-3-gcff8f044]

On the origin system, my home is /home/matth while on destination system, it’s /users/leclairm, hence the permission error for /home/....

If I use the quotes with rsync test_file 'daint:$HOME', the transfer completes but still fails as the destination file is then named ‘$HOME’:

leclairm@daint102:~
> ls -lrt
...
-rw-r--r--  1 leclairm s83      0 Aug 26 19:23 '$HOME'

Maybe you could test the other way round if omitting $HOME would work on your systems. I’m fairly confident a relative destination path with rsync is relative to the destination home directory (and of course an absolute path is just kept untouched). At least for me

rsync test_file daint:

transfers test_file to my destination home directory in the same way as

rsync test_file daint:/users/leclairm

Agreed. I’m just curious why your system is behaving differently to all the other systems we’ve encountered.

I found out running the test from another HPC system to daint works. they both share this version of rsync:

leclairm@daint105:~
> rsync --version
rsync  version 3.1.3  protocol version 31
Copyright (C) 1996-2018 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, prealloc, SLP

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.

while my laptop runs this version:

rsync  version v3.2.5pre2-3-gcff8f044  protocol version 31
Copyright (C) 1996-2022 by Andrew Tridgell, Wayne Davison, and others.
Web site: https://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, symlinks, symtimes, hardlinks, hardlink-specials,
    hardlink-symlinks, IPv6, atimes, batchfiles, inplace, append, ACLs,
    xattrs, optional protect-args, iconv, prealloc, stop-at, no crtimes
Optimizations:
    SIMD-roll, no asm-roll, openssl-crypto, no asm-MD5
Checksum list:
    xxh128 xxh3 xxh64 (xxhash) md5 md4 none
Compress list:
    zstd lz4 zlibx zlib none

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details

My OS is Manjaro, so a rolling release distribution, with often newer software versions as, e.g., HPC systems. So I guess the issue will pop up on other systems as well sooner or later. Or there’s just a bug in my bleeding edge software which will be fixed later, that’s sometimes the down side of rolling releases …

As stated in https://rsync.samba.org/, version 3.2.5 was released Aug 14th. The pre-release I’m using is from Aug 9th so I guess 3.2.5 will be installed soon and I’ll be able to check if the new behavior persists. Anyhow, it might still make sense to consider dropping the usage of $HOME in the rsync command.

Perhaps we need to document that better.

If you define a job platform with multiple hosts and a job runner that Cylc can use to interact with jobs, the install target is the one host that Cylc should target to install files and symlink run directories. (One only, because platform hosts see the same shared filesystem).

[And note, multiple platforms can be defined for the same shared filesystem.]

This same issue has just popped up for us at NCI. I wonder if a change in a system package has happened or something, it was working fine before

Adding to the platform definition

rsync command = rsync --old-args

appears to fix the issue

2 Likes

Thanks for the quick temporary fix!
Works for me as well.

1 Like

@ScottWales - thanks for finding a workaround.
This appears to be related to a behaviour change in rsync 3.2.4:
https://download.samba.org/pub/rsync/NEWS#3.2.4
We’ll look at removing the use of $HOME/ in the rsync command.