Remote submission - cylc: line 0: exec: cylc: not found

I’m currently in the process of developing a new sequencing system for Canadian Earth System Model and am experimenting with cylc (specifically the 8.0rc1) but am running into issues using the batch system on our internal systems.

Background

After going through all the tutorials, I’m now trying to run some test jobs on our internal clusters, which consists of four separate compute clusters that all share a home directory. As an added complication, on two of the machines we don’t have access to raw PBS commands even though it is the underlying queueing system - they use a wrapper (to allow for remote submission across machines) around it, which is called jobctl.

As such, for these machines I’ve created [platform] entries like:

[platforms]
    [[ppp3]]
        hosts = eccc1-ppp3, eccc2-ppp3, eccc3-ppp3
        job runner = jobctl
    [[ppp4]]
        hosts = eccc1-ppp4, eccc2-ppp4, eccc3-ppp4
        job runner = jobctl

(storing this in ~/.cylc/flow/global.cylc).

For my test flow, I’ve create this file:

from cylc.flow.job_runner_handlers.pbs import PBSHandler

class JOBCTLHandler(PBSHandler):
    """
        Given that jobctl just is a wrapper around PBS, we _should_
        only need to alter the commands used as the PBS directives
        should be supported
    """
    POLL_CMD        = "jobst"
    KILL_CMD_TMPL   = "jobdel '%(job_id)s'"
    SUBMIT_CMD_TMPL = "jobsub '%(jobs)s'"

JOB_RUNNER_HANDLER = JOBCTLHandler()

and stored it at ~/cylc-src/batch-test/lib/python/jobctl.py (I still haven’t quite figured out where to store this so I don’t need to have it for every flow)

Then defined ~/cylc-src/batch-test/flow.cylc

[scheduler]
    UTC mode = True

[scheduling]

    initial cycle point = 2000-01-01T00
    final cycle point = +PT6H
    [[graph]]
        PT1H = """
            find_diamonds => sell_diamonds
            sell_diamonds[-PT1H] => find_diamonds
        """

[runtime]

    [[find_diamonds]]
        platform = ppp3
        script = find_diamonds.sh
        [[[directives]]]
            -l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
            -q = development

    [[sell_diamonds]]
        platform = ppp3
        script = sell_diamonds.sh
        [[[directives]]]
            -l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
            -q = development

The Error

After validating, installing, and playing this flow, I’m now running into REMOTE INIT FAILED error (seen in ~/cylc-run/batch-test/run4/log/workflow/log) - specifically, I see

    COMMAND:
        ssh -oBatchMode=yes -oConnectTimeout=10 eccc1-ppp3 env \
            CYLC_VERSION=8.0rc1 bash --login -c 'exec "$0" "$@"' cylc \
            remote-init -v -v ppp3 $HOME/cylc-run/batch-test/run4
    RETURN CODE:
        0
    STDOUT:
        cleaning up temporary directories
    STDERR:
        cylc: line 0: exec: cylc: not found        

which confuses me. Given that I’m controlling cylc from within a conda environment, it doesn’t surprise me that on the remote machine it doesn’t see the cylc command? Do I need to make the ssh command activate the environment?

My apologies if I’m missing something obvious!

Ahhh I think I was missing something obvious I need to define cylc_path for the platforms!

For anyone else who comes across this error, I was able to get around this error by setting cylc_path in my platform configuration - i.e.

cylc_path = /path/to/conda/envs/cylc-8.0rc1/bin

I’m now running into this error:

Unexpected key directory exists: /home/rcs001/cylc-run/batch-test/run9/.service/client_public_keys 
Check global.cylc install target is configured correctly for this platform.

I’ll open a new topic for this if I can’t figure it out (I’ve figured out the solution shortly after posting for my previous topics… so I’ll give it some time :slight_smile: )

I’m now running into this error:

Unexpected key directory exists: /home/rcs001/cylc-run/batch-test/run9/.service/client_public_keys
  Check global.cylc install target is configured correctly for this platform.

This question has been addressed here: Setting up a Platform Config for machines that share a home directory - #2 by hilary.j.oliver

Given that I’m controlling cylc from within a conda environment, it doesn’t surprise me that on the remote machine it doesn’t see the cylc command? Do I need to make the ssh command activate the environment?

Ahhh I think I was missing something obvious I need to define cylc_path [to the Cylc conda environment] for the platforms!

That is one valid solution to the problem, although it only works if you have a single version of Cylc deployed. Which I’m sure is fine for you at the moment.

Alternatively, you can put the path to Cylc in login scripts so that it is automatically available on job hosts when the scheduler initiates processes there via ssh.

More generally, Cylc is designed to work even if you have multiple versions of Cylc being used at once, even under the same user account (imagine you have some long-running workflows that you don’t want to upgrade yet). For this, you can use a wrapper script (called cylc) that invokes the right Cylc version for jobs (e.g. the right conda environment) according to Cylc variables set by the scheduler. Then, only that wrapper needs to be in the default path (on the scheduler and job hosts).

If you want to adapt the wrapper packaged with Cylc, which at this stage is mainly internally documented, this shows how to extract it: Installation — Cylc 8.2.2 documentation

Thanks a lot for the quick feed back @hilary.j.oliver - definitely appreciated. Yeah, for now this works, but once I get the cylc sequencing system working for us, I’ll need to think how to deploy it across users, and I’ll probably need something like your laid out solution, so thanks!