Running on remote

I have a system which primarily uses pbs, but has a subset of nodes that use slurm. These nodes are not binary compatible with the cylc host nodes. The job submits and runs correctly but cylc tries to run on the compute node and crashes. Is there a way to set up polling for just that one job? Using [[hosts]] in the global config file doesn’t seem to do what I want since that seems to want to ssh to the remote to submit the job.

[meta]
    title = "The cylc Hello World! suite"
[scheduling]
    [[dependencies]]
        graph = "hello => hello_casper"

[runtime]
    [[hello]]
        script = "sleep 10; echo Hello World!"
    [[[job]]]
      batch system = pbs
      batch submit command template = qsub -q regular -l walltime=01:00:00 -A NCGD0042 '%(job)s'
    [[[directives]]]
       -r = n
       -j = oe
       -V =
       -S = /bin/bash
       -l = select=1:ncpus=36:ompthreads=36

    [[hello_casper]]
        script = "sleep 10; echo Hello World from casper!"
    [[[job]]]
      batch system = slurm
    [[[directives]]]
       --ntasks=1
       --cpus-per-task=8
       --patition=dav

Hi Jim,

The job submits and runs correctly but cylc tries to run on the compute node and crashes.

I presume you mean the cylc message command crashes on the job host? (when it tries to send back job status messages).

Cylc is written in Python so it should not matter that your Slurm host is not binary compatible with the Cylc host.

So: is cylc message just failing because on the job host because it can’t send messages back (in which case, check network configuration), OR is cylc generally non-functional there because some Python library is missing, or something like that?. Your job.err file should reveal which.

Using [[hosts]] in the global config file doesn’t seem to do what I want that seems to want to ssh to the remote to submit the job.

What are you setting under the [[hosts]] heading exactly? If you configure a task with a remote host and batch system Slurm, Cylc will ssh to the host and submit the job to Slurm there, but that is via the task (or family) remote section, not global config hosts section.

If you have a local Slurm client (i.e. on the Cylc host) then don’t specify a remote in the suite configuration - then it is a local job as far as Cylc is concerned, regardless of where Slurm executes the job. (however note in this case you must have a shared filesystem between the local and job hosts).

Is there a way to set up polling for just that one job?

If you can’t get network comms back from a job host you can configure job polling (per host) as a “task communication method” - see https://cylc.github.io/doc/built-sphinx-single/index.html#job-polling

Hilary

ERROR:root:code for hash md5 was not found.
Traceback (most recent call last):
  File "/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py", line 147, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py", line 97, in __get_builtin_constructor
    raise ValueError('unsupported hash type ' + name)
ValueError: unsupported hash type md5

also this

Traceback (most recent call last):
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/bin/cylc-message", line 140, in <module>
    main()
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/bin/cylc-message", line 136, in main
    return record_messages(suite, task_job, messages)
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/task_message.py", line 82, in record_messages
    'messages': messages})
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py", line 275, in put_messages
    results = self._call_server(func_name, payload=payload)
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py", line 338, in _call_server
    return self.call_server_impl(url, method, payload)
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py", line 369, in call_server_impl
    return impl(url, method, payload)
  File "/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py", line 477, in _call_server_impl_urllib2
    import ssl
  File "/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/ssl.py", line 98, in <module>
    import _ssl             # if we can't import it, let the error propagate
ImportError: libssl.so.1.0.0: cannot open shared object file: No such file or directory

Looks like you don’t have SSL installed.

See https://cylc.github.io/doc/built-sphinx/installation.html#third-party-software-packages

(And hashlib is related, see e.g. https://stackoverflow.com/questions/20399331/error-importing-hashlib-with-python-2-7-but-not-with-2-6)

Okay I now have a proper cylc build on those nodes. How do I let cylc know the correct path to
that build? I am submitting from localhost, so using some [[remote]] option doesn’t seem to be the correct solution.

You need to install the cylc wrapper script in $HOME/bin, see
https://cylc.github.io/doc/built-sphinx/installation.html#local-user-installation

Thanks, that did it. But it doesn’t seem that cylc is sourcing my .bashrc,
I had to include that explicitly in my job-init-env.sh. The documentation your link points to seems to suggest that that should happen automatically. Any idea why it doesn’t?

duh - nevermind, I moved .bashrc to .bash_profile and now it’s working as expected.

Hi Jim,

That’s right - task job scripts invoke bash login shells, so you need to use .bash_profile. There are several other ways you can configure the environment for Cylc on job hosts too, but bash login scripts is the best way these days.

Hilary