I’m currently in the process of developing a new sequencing system for Canadian Earth System Model and am experimenting with
cylc (specifically the 8.0rc1) but am running into issues using the batch system on our internal systems.
After going through all the tutorials, I’m now trying to run some test jobs on our internal clusters, which consists of four separate compute clusters that all share a home directory. As an added complication, on two of the machines we don’t have access to raw PBS commands even though it is the underlying queueing system - they use a wrapper (to allow for remote submission across machines) around it, which is called
As such, for these machines I’ve created
[platform] entries like:
[platforms] [[ppp3]] hosts = eccc1-ppp3, eccc2-ppp3, eccc3-ppp3 job runner = jobctl [[ppp4]] hosts = eccc1-ppp4, eccc2-ppp4, eccc3-ppp4 job runner = jobctl
(storing this in
For my test flow, I’ve create this file:
from cylc.flow.job_runner_handlers.pbs import PBSHandler class JOBCTLHandler(PBSHandler): """ Given that jobctl just is a wrapper around PBS, we _should_ only need to alter the commands used as the PBS directives should be supported """ POLL_CMD = "jobst" KILL_CMD_TMPL = "jobdel '%(job_id)s'" SUBMIT_CMD_TMPL = "jobsub '%(jobs)s'" JOB_RUNNER_HANDLER = JOBCTLHandler()
and stored it at
~/cylc-src/batch-test/lib/python/jobctl.py (I still haven’t quite figured out where to store this so I don’t need to have it for every flow)
[scheduler] UTC mode = True [scheduling] initial cycle point = 2000-01-01T00 final cycle point = +PT6H [[graph]] PT1H = """ find_diamonds => sell_diamonds sell_diamonds[-PT1H] => find_diamonds """ [runtime] [[find_diamonds]] platform = ppp3 script = find_diamonds.sh [[[directives]]] -l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest -q = development [[sell_diamonds]] platform = ppp3 script = sell_diamonds.sh [[[directives]]] -l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest -q = development
After validating, installing, and playing this flow, I’m now running into
REMOTE INIT FAILED error (seen in
~/cylc-run/batch-test/run4/log/workflow/log) - specifically, I see
COMMAND: ssh -oBatchMode=yes -oConnectTimeout=10 eccc1-ppp3 env \ CYLC_VERSION=8.0rc1 bash --login -c 'exec "$0" "$@"' cylc \ remote-init -v -v ppp3 $HOME/cylc-run/batch-test/run4 RETURN CODE: 0 STDOUT: cleaning up temporary directories STDERR: cylc: line 0: exec: cylc: not found
which confuses me. Given that I’m controlling
cylc from within a
conda environment, it doesn’t surprise me that on the remote machine it doesn’t see the
cylc command? Do I need to make the
ssh command activate the environment?
My apologies if I’m missing something obvious!