Here’s a simple example that runs one local background task, followed by one remote background task on a platform that does not see the same filesystem:
# flow.cylc
[scheduling]
[[graph]]
R1 = "foo => bar"
[runtime]
[[root]]
script = "echo Hello World"
[[foo]] # local task
[[bar]] # remote task
platform = my_remote
The remote job platform definition:
# global.cylc
[platforms]
[[my_remote]]
hosts = head1
install target = head1 # (if omitted, this defaults to the platform name)
On running this:
$ cylc vip --no-detach --no-timestamp
-> $ cylc validate /home/oliverh/cylc-src/gug
Valid for cylc-8.3.0.dev
-> $ cylc install /home/oliverh/cylc-src/gug
INSTALLED gug/run4 from /home/oliverh/cylc-src/gug
-> $ cylc play --no-detach --no-timestamp gug/run4
▪ ■ Cylc Workflow Engine 8.3.0.dev
██ Copyright (C) 2008-2023 NIWA
▝▘ & British Crown (Met Office) & Contributors
INFO - Extracting job.sh to /home/oliverh/cylc-run/gug/run4/.service/etc/job.sh
INFO - Workflow: gug/run4
INFO - Scheduler: url=tcp://NIWA-1022450.niwa.local:43026 pid=6938
INFO - Workflow publisher: url=tcp://NIWA-1022450.niwa.local:43044
INFO - Run: (re)start number=1, log rollover=1
INFO - Cylc version: 8.3.0.dev
INFO - Run mode: live
INFO - Initial point: 1
INFO - Final point: 1
INFO - Cold start from 1
INFO - New flow: 1 (original flow from 1) 2023-08-29 15:17:07
INFO - [1/foo waiting(runahead) job:00 flows:1] => waiting
INFO - [1/foo waiting job:00 flows:1] => waiting(queued)
INFO - [1/foo waiting(queued) job:00 flows:1] => waiting
INFO - [1/foo waiting job:01 flows:1] => preparing
INFO - [1/foo preparing job:01 flows:1] submitted to localhost:background[6958]
INFO - [1/foo preparing job:01 flows:1] => submitted
INFO - [1/foo submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
INFO - [1/foo submitted job:01 flows:1] => running
INFO - [1/foo running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
INFO - [1/foo running job:01 flows:1] => succeeded
INFO - [1/bar waiting(runahead) job:00 flows:1] => waiting
INFO - [1/bar waiting job:00 flows:1] => waiting(queued)
INFO - [1/bar waiting(queued) job:00 flows:1] => waiting
INFO - [1/bar waiting job:01 flows:1] => preparing
>>> INFO - platform: my_remote - remote init (on head1)
>>> INFO - platform: my_remote - remote file install (on head1)
>>> INFO - platform: my_remote - remote file install complete
INFO - [1/bar preparing job:01 flows:1] submitted to my_remote:background[386]
INFO - [1/bar preparing job:01 flows:1] => submitted
INFO - [1/bar submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
INFO - [1/bar submitted job:01 flows:1] => running
INFO - [1/bar running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
INFO - [1/bar running job:01 flows:1] => succeeded
INFO - Workflow shutting down - AUTOMATIC
INFO - platform: my_remote - remote tidy (on head1)
INFO - DONE
In the transcript above I’ve highlighted remote platform initialization with >>>
. This should happen the first time the scheduler runs a task on the remote, if it has a different install target (i.e., if it does not see the same filesystem as the scheduler).
If I screw up the platform definition by adding install target = localhost
, then:
- I will not see the remote initialization lines in the scheduler log
- the remote task will appear to hang in the scheduler - the job can’t communicate its status back because the remote platform wasn’t initialized for the workflow
- the remote job error log shows:
$ cylc log -f e gug//1/bar
/other/cylc-run/gug/run2/log/job/1/bar/01/job: line 45: /root/cylc-run/gug/run2/.service/etc/job.sh: No such file or directory
/other/cylc-run/gug/run2/log/job/1/bar/01/job: line 46: cylc__job__main: command not found
Which looks suspiciously like your problem!