I’m currently in the process of developing a new sequencing system for Canadian Earth System Model and am experimenting with cylc
(specifically the 8.0rc1) but am running into issues using the batch system on our internal systems.
Background
After going through all the tutorials, I’m now trying to run some test jobs on our internal clusters, which consists of four separate compute clusters that all share a home directory. As an added complication, on two of the machines we don’t have access to raw PBS commands even though it is the underlying queueing system - they use a wrapper (to allow for remote submission across machines) around it, which is called jobctl
.
As such, for these machines I’ve created [platform]
entries like:
[platforms]
[[ppp3]]
hosts = eccc1-ppp3, eccc2-ppp3, eccc3-ppp3
job runner = jobctl
[[ppp4]]
hosts = eccc1-ppp4, eccc2-ppp4, eccc3-ppp4
job runner = jobctl
(storing this in ~/.cylc/flow/global.cylc
).
For my test flow, I’ve create this file:
from cylc.flow.job_runner_handlers.pbs import PBSHandler
class JOBCTLHandler(PBSHandler):
"""
Given that jobctl just is a wrapper around PBS, we _should_
only need to alter the commands used as the PBS directives
should be supported
"""
POLL_CMD = "jobst"
KILL_CMD_TMPL = "jobdel '%(job_id)s'"
SUBMIT_CMD_TMPL = "jobsub '%(jobs)s'"
JOB_RUNNER_HANDLER = JOBCTLHandler()
and stored it at ~/cylc-src/batch-test/lib/python/jobctl.py
(I still haven’t quite figured out where to store this so I don’t need to have it for every flow)
Then defined ~/cylc-src/batch-test/flow.cylc
[scheduler]
UTC mode = True
[scheduling]
initial cycle point = 2000-01-01T00
final cycle point = +PT6H
[[graph]]
PT1H = """
find_diamonds => sell_diamonds
sell_diamonds[-PT1H] => find_diamonds
"""
[runtime]
[[find_diamonds]]
platform = ppp3
script = find_diamonds.sh
[[[directives]]]
-l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
-q = development
[[sell_diamonds]]
platform = ppp3
script = sell_diamonds.sh
[[[directives]]]
-l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
-q = development
The Error
After validating, installing, and playing this flow, I’m now running into REMOTE INIT FAILED
error (seen in ~/cylc-run/batch-test/run4/log/workflow/log
) - specifically, I see
COMMAND:
ssh -oBatchMode=yes -oConnectTimeout=10 eccc1-ppp3 env \
CYLC_VERSION=8.0rc1 bash --login -c 'exec "$0" "$@"' cylc \
remote-init -v -v ppp3 $HOME/cylc-run/batch-test/run4
RETURN CODE:
0
STDOUT:
cleaning up temporary directories
STDERR:
cylc: line 0: exec: cylc: not found
which confuses me. Given that I’m controlling cylc
from within a conda
environment, it doesn’t surprise me that on the remote machine it doesn’t see the cylc
command? Do I need to make the ssh
command activate the environment?
My apologies if I’m missing something obvious!