Running on multiple pbs platforms

We are running our suite on a machine cheyenne which uses pbs but need to run some of the steps on another machine casper which until today used slurm. This was all working fine but then… our system admins switched casper from slurm to pbs but installed a different (newer) pbs version on casper than what is on cheyenne. So they changed the qsub command to be qsubcasper. I have this working using the
batch submit command template = qsubcasper -q casper -l walltime=01:00:00 -A P93300606 ‘%(job)s’

however the suite fails upon completion of this job, the status indicates that the job is submitted so I think that the problem is that I also need to override the qstat command in a similar way.

If cylc is using qstat then the override I need is “qstat @casper

I found pbs_multi_cluster and was hopeful that it would solve the issue. But so far no luck.
The job step completes according to
01/job.status
CYLC_BATCH_SYS_NAME=pbs_multi_cluster
CYLC_BATCH_SYS_JOB_ID=41093.casper-pbs@casper-pbs
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2021-04-07T17:51:00-06:00
CYLC_JOB_PID=294384
CYLC_JOB_INIT_TIME=2021-04-07T23:51:02Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2021-04-07T23:51:02Z

However I get a bunch of errors in the job.out that look like they’re coming from cylc someplace:

ERROR:root:code for hash md5 was not found.
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 147, in
globals()[__func_name] = __get_hash(__func_name)
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 97, in __get_builtin_constructor
raise ValueError('unsupported hash type ’ + name)
ValueError: unsupported hash type md5
ERROR:root:code for hash sha1 was not found.
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 147, in
globals()[__func_name] = __get_hash(__func_name)
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 97, in __get_builtin_constructor
raise ValueError('unsupported hash type ’ + name)
ValueError: unsupported hash type sha1
ERROR:root:code for hash sha224 was not found.
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 147, in
globals()[__func_name] = __get_hash(__func_name)
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 97, in __get_builtin_constructor
raise ValueError('unsupported hash type ’ + name)
ValueError: unsupported hash type sha224
ERROR:root:code for hash sha256 was not found.
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 147, in
globals()[__func_name] = __get_hash(__func_name)
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 97, in __get_builtin_constructor
raise ValueError('unsupported hash type ’ + name)
ValueError: unsupported hash type sha256
ERROR:root:code for hash sha384 was not found.
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 147, in
globals()[__func_name] = __get_hash(__func_name)
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 97, in __get_builtin_constructor
raise ValueError('unsupported hash type ’ + name)
ValueError: unsupported hash type sha384
ERROR:root:code for hash sha512 was not found.
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 147, in
globals()[__func_name] = __get_hash(__func_name)
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/hashlib.py”, line 97, in __get_builtin_constructor
raise ValueError('unsupported hash type ’ + name)
ValueError: unsupported hash type sha512
2021-04-07T23:51:02Z INFO - started
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/bin/cylc-message”, line 140, in
main()
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/bin/cylc-message”, line 136, in main
return record_messages(suite, task_job, messages)
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/task_message.py”, line 82, in record_messages
‘messages’: messages})
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py”, line 275, in put_messages
results = self._call_server(func_name, payload=payload)
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py”, line 338, in _call_server
return self.call_server_impl(url, method, payload)
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py”, line 369, in call_server_impl
return impl(url, method, payload)
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py”, line 477, in _call_server_impl_urllib2
import ssl
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/ssl.py”, line 98, in
import _ssl # if we can’t import it, let the error propagate
ImportError: libssl.so.1.0.0: cannot open shared object file: No such file or directory

Hi @Jim,

The form of the PBS job ID recorded in your job.status file seems to gel with the comments in the Cylc pbs_multi_cluster handler, so I think you’re on the right track.

However, qstat is not used routinely to track task job status unless you’re using the “polling” task communication method. The default comms method is the job wrapper runs cylc message to send a message back over the network to the Cylc scheduler.

In the error traceback I’m not sure where the initial haslib problem is coming from, but halfway down it shows your job is using cylc message and that’s what fails, which explains why the job status doesn’t get updated after “submitted”.

2021-04-07T23:51:02Z INFO - started
Traceback (most recent call last):
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/bin/cylc-message”, line 140, in
main()
...
File “/glade/u/apps/ch/opt/cylc/7.8.3/gnu/8.3.0/cylc-7.8.3/lib/cylc/network/httpclient.py”, line 477, in _call_server_impl_urllib2
import ssl
File “/glade/u/apps/ch/opt/python/2.7.16/gnu/8.3.0/lib/python2.7/ssl.py”, line 98, in
import _ssl # if we can’t import it, let the error propagate
ImportError: libssl.so.1.0.0: cannot open shared object file: No such file or directory

So it seems the SSL library is not installed or not accessible on the system?

https://cylc.github.io/cylc-doc/stable/html/installation.html

It may require admin help to get this fixed. In the meantime you could try using pbs_multi_cluster and configuring cylc to use job polling to track job status, on that host.

Hilary

Thanks Hilary - with the assistance of support staff we were able to modify etc/job-init-env.sh to work correctly.

1 Like