Cylc8 migration issues

Here’s a simple example that runs one local background task, followed by one remote background task on a platform that does not see the same filesystem:

# flow.cylc
[scheduling]
    [[graph]]
        R1 = "foo => bar"
[runtime]
   [[root]]
      script = "echo Hello World"
   [[foo]]  # local task
   [[bar]]  # remote task
      platform = my_remote

The remote job platform definition:

# global.cylc
[platforms]
    [[my_remote]]
        hosts = head1
        install target = head1  # (if omitted, this defaults to the platform name)

On running this:

$ cylc vip --no-detach --no-timestamp 

-> $ cylc validate /home/oliverh/cylc-src/gug
Valid for cylc-8.3.0.dev
-> $ cylc install /home/oliverh/cylc-src/gug
INSTALLED gug/run4 from /home/oliverh/cylc-src/gug
-> $ cylc play --no-detach --no-timestamp gug/run4

 ▪ ■  Cylc Workflow Engine 8.3.0.dev
 ██   Copyright (C) 2008-2023 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to /home/oliverh/cylc-run/gug/run4/.service/etc/job.sh
INFO - Workflow: gug/run4
INFO - Scheduler: url=tcp://NIWA-1022450.niwa.local:43026 pid=6938
INFO - Workflow publisher: url=tcp://NIWA-1022450.niwa.local:43044
INFO - Run: (re)start number=1, log rollover=1
INFO - Cylc version: 8.3.0.dev
INFO - Run mode: live
INFO - Initial point: 1
INFO - Final point: 1
INFO - Cold start from 1
INFO - New flow: 1 (original flow from 1) 2023-08-29 15:17:07
INFO - [1/foo waiting(runahead) job:00 flows:1] => waiting
INFO - [1/foo waiting job:00 flows:1] => waiting(queued)
INFO - [1/foo waiting(queued) job:00 flows:1] => waiting
INFO - [1/foo waiting job:01 flows:1] => preparing
INFO - [1/foo preparing job:01 flows:1] submitted to localhost:background[6958]
INFO - [1/foo preparing job:01 flows:1] => submitted
INFO - [1/foo submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
INFO - [1/foo submitted job:01 flows:1] => running
INFO - [1/foo running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
INFO - [1/foo running job:01 flows:1] => succeeded
INFO - [1/bar waiting(runahead) job:00 flows:1] => waiting
INFO - [1/bar waiting job:00 flows:1] => waiting(queued)
INFO - [1/bar waiting(queued) job:00 flows:1] => waiting
INFO - [1/bar waiting job:01 flows:1] => preparing
>>> INFO - platform: my_remote - remote init (on head1)
>>> INFO - platform: my_remote - remote file install (on head1)
>>> INFO - platform: my_remote - remote file install complete
INFO - [1/bar preparing job:01 flows:1] submitted to my_remote:background[386]
INFO - [1/bar preparing job:01 flows:1] => submitted
INFO - [1/bar submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
INFO - [1/bar submitted job:01 flows:1] => running
INFO - [1/bar running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
INFO - [1/bar running job:01 flows:1] => succeeded
INFO - Workflow shutting down - AUTOMATIC
INFO - platform: my_remote - remote tidy (on head1)
INFO - DONE

In the transcript above I’ve highlighted remote platform initialization with >>>. This should happen the first time the scheduler runs a task on the remote, if it has a different install target (i.e., if it does not see the same filesystem as the scheduler).

If I screw up the platform definition by adding install target = localhost, then:

  • I will not see the remote initialization lines in the scheduler log
  • the remote task will appear to hang in the scheduler - the job can’t communicate its status back because the remote platform wasn’t initialized for the workflow
  • the remote job error log shows:
$ cylc log -f e gug//1/bar
/other/cylc-run/gug/run2/log/job/1/bar/01/job: line 45: /root/cylc-run/gug/run2/.service/etc/job.sh: No such file or directory
/other/cylc-run/gug/run2/log/job/1/bar/01/job: line 46: cylc__job__main: command not found

Which looks suspiciously like your problem!