Cylc 8.0b1 task communication fails on PBS

Hello,

I’ve noticed that task/job states do not seem to update in my runs on Cheyenne. Whether jobs fail or succeed they will show up on cylc scan until I end them manually. I also found that if I submit a suite with a simple dependency like “foo=>bar”, bar will not fire even though foo ended successfully. If I manually stop foo and then ‘cylc play’ it will then recognize that foo completed and run bar.

As per usual, I’m not sure if this is a bug or an error I’ve made in configuring the suite. :slight_smile: Any help is appreciated!

If task communications are failing the job.err file might offer some insights if you’re able to get hold of it. When the workflow restarts Cylc looks at the job’s job.status files to see if anything has changed whilst the Scheduler wasn’t running which is why the task statuses are updating on restart.

The default task communications (zmq) requires the ability to open a TCP connection between the job node and scheduler host without going through any interactive prompts. I don’t know if this is an option on Cheyenne with 2FA?

If TCP comms aren’t an option you can fall back to polling where Cylc looks at the job.status file at configured intervals, example global configuration:

[platforms]
  [[cheyenne]]
    hosts = localhost # ?
    communication method = poll
    submission polling intervals = 5*PT1M, 10*PT5M  # edit to something suitable
    execution polling intervals = 5*PT1M, 10*PT5M  # edit to something suitable
    

https://cylc.github.io/cylc-doc/latest/html/reference/config/global.html#global.cylc[platforms][<platform%20name>]communication%20method

Thanks! I’ll put in a hello=>world version of this script and see what job.err shows. Do you know if the communication mechanics changed between 7 and 8? According to @Jim cylc7 communication works on Cheyenne without much issue, so this is either an 8 problem or a Ben problem. :wink:

We have changed the communication method, we swapped HTTPS for ZMQ. Both are really just protocols for speaking TCP (and are doing so on the same port range) so technically speaking there is very little change from Cylc7->8, however, maybe there’s something different about how they are handled on Cheyenne?

Some updates:

Suite runs as expected when not submitted to the batch scheduler - both jobs run and exit without error.

When submitting to the batch scheduler, first job (hello) runs and exits without error. Expected output is present in job.out and job.err is empty.

Looking at the suite log it looks like the default polling is 15 minutes and I am not seeing any errors so far, so maybe I was just impatient the first time I submitted this. I will let a few 15 minute intervals play out and see if the “world” job fires.

Edit: That seems to have been the issue for the dependent job - I had to wait 15 minutes for the polling interval to complete. Is that expected? My impression was the default behavior was to update immediately but I could have that wrong.

I’m still not sure that the suite status is updating properly when a job fails but I will explore that separately.

To be pedantic (for clarity, I hope!):

  • in Cylc 8 we’re changing terminology from “suite” to the more widely understood “workflow”
  • we configure a workflow to submit (some or all of its) jobs to PBS; we’re not submitting the workflow itself (or, the Cylc scheduler that runs it) to PBS

Once a task job is executing it doesn’t matter (e.g. for Cylc job status update) if it was launched by PBS or not. Presumably the relevant difference here is that PBS will run the job on another host rather than on localhost (with respect to the Cylc scheduler).

Is that with communication method = poll as suggested above by @oliver.sanders, or not?

To update job status, the default (and preferred) communication method is direct messaging (over TCP) from the job back to the Cylc scheduler. If the right network channels aren’t open for that you should see message send failures in the task’s job.err file.

The alternative to task messaging is job polling, where the Cylc scheduler queries the job host at intervals to determine job status. As you’ve discovered, there is a hardwired default job poll interval of 15 minutes, but that’s really intended for last-resort checking in case of temporary TCP comms failures.

You can also configure the polling interval to be whatever you like (see @oliver.sanders example above) so you don’t have to wait for 15 minutes. If you actually set communication method = poll for a platform Cylc won’t even bother trying the direct TCP method, just polling.

Both TCP messaging and job polling should correctly update for both job success and failure (and custom task message outputs as well).

Clarity is good and appreciated. :slight_smile: One thing I have learned in working on workflows is that practically everyone uses similar terminology to mean (sometimes not-so-slightly) different things. I’ll try to be careful in my usage and follow these definitions.

It is using the default settings from installing with conda - I have not added the “communication method = poll” line to my global.cylc.

I am not seeing any error messages in the job.err file.

So this appears to be what it is going on in my case. The lack of error messages in job.err makes me think if the direct TCP method is turned off for some reason. Is there some setting I can check to try and verify this?

Ain’t that the truth.

Try printing the global config for your platform as configured in your global.cylc file:

$ cylc config --sparse "[platforms]<your-platform-name>"

The --sparse option prints only explicitly set options, which could come from a central global config or your user one, but not the built-in defaults for all settings.

If you see communication method = poll then the default (zmq) has been turned off. However, I’m guessing there’s no Cylc 8 central config on your system yet, and you’d know if you’d set this yourself. So this is a bit mysterious.

Can you try running with cylc play --debug and see if anything ends up in the job.err file, or if job.xtrace shows cylc message is called by the job?

cylc config shows ‘communication method = zmq’, so that isn’t where the problem is coming from. HOWEVER, running with --debug does turn up errors:

Sending DEBUG MODE xtrace to job.xtrace
 Traceback (most recent call last):
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/task_message.py", line 108, in send_messages
     pclient = get_client(suite)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/client_factory.py", line 52, in get_client
     return get_runtime_client(get_comms_method(), workflow, timeout=timeout)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/client_factory.py", line 47, in get_runtime_client
     return SuiteRuntimeClient(workflow, timeout=timeout)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/client.py", line 126, in __init__
     host, port, _ = get_location(suite)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/__init__.py", line 81, in get_location
     host = get_fqdn_by_host(host)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/hostuserutil.py", line 248, in get_fqdn_by_host
     return HostUtil.get_inst().get_fqdn_by_host(target)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/hostuserutil.py", line 154, in get_fqdn_by_host
     return self._get_host_info(target)[0]
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/hostuserutil.py", line 118, in _get_host_info
     self._host_exs[target] = socket.gethostbyname_ex(target)
 socket.gaierror: [Errno -2] Name or service not known: 'cheyenne5.cheyenne.ucar.edu'
 /bin/sh: BASH_XTRACEFD: 19: invalid value for trace file descriptor
 Traceback (most recent call last):
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/task_message.py", line 108, in send_messages
     pclient = get_client(suite)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/client_factory.py", line 52, in get_client
     return get_runtime_client(get_comms_method(), workflow, timeout=timeout)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/client_factory.py", line 47, in get_runtime_client
     return SuiteRuntimeClient(workflow, timeout=timeout)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/client.py", line 126, in __init__
     host, port, _ = get_location(suite)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/network/__init__.py", line 81, in get_location
     host = get_fqdn_by_host(host)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/hostuserutil.py", line 248, in get_fqdn_by_host
     return HostUtil.get_inst().get_fqdn_by_host(target)
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/hostuserutil.py", line 154, in get_fqdn_by_host
     return self._get_host_info(target)[0]
   File "/glade/u/home/bcash/.conda/envs/cylc8b1_v2/lib/python3.7/site-packages/cylc/flow/hostuserutil.py", line 118, in _get_host_info
     self._host_exs[target] = socket.gethostbyname_ex(target)
 socket.gaierror: [Errno -2] Name or service not known: 'cheyenne5.cheyenne.ucar.edu'

In addition to this error, when the ‘hello’ and ‘world’ tasks start up the following error is displayed in my terminal:

(cylc8b1_v2) bcash@cheyenne5:/glade/work/bcash/cylc/debug_comms> X11 connection rejected because of wrong authentication.

If I ping cheyenne5 from a login node ‘cheyenne5.cheyenne.ucar.edu’ is the name it returns, so the name isn’t malformed or something. Hopefully something in this is helpful!

EDIT: job.xtrace shows both:

 +[20210608T064901-0600]bcash@r2i6n23 cylc message -- debug_comms/run7 1/hello/01 started

and later:

 +[20210608T064941-0600]bcash@r2i6n23 cylc message -- debug_comms/run7 1/hello/01 succeeded

socket.gaierror: [Errno -2] Name or service not known: ‘cheyenne5.cheyenne.ucar.edu’

If I ping cheyenne5 from a login node ‘cheyenne5.cheyenne.ucar.edu’ is the name it returns

I think that means that the fully-qualified domain name (FQDN) of the login node (cheyenne5.cheyenne.ucar.edu) does not resolve on the compute node where the job is run.

I would expect this to cause the same issue for Cylc 8 as Cylc 7, it might be worth digging through the Cylc 7 global config to see if anything was done to handle this, namely the host self-identification setting which may be set to “address” to bypass the issue?

X11 connection rejected because of wrong authentication.

That suggests the job is trying to open a graphical shell. Cylc shouldn’t attempt to do this, I sometimes see this from things like Python’s matplotlib.

In the cylc7 config I found:

[suite host self-identification]
     host =
     method = address
     target = google.com

And in cylc8:

    [[host self-identification]]
        method = name
         target = google.com
         host =

Seems like that could be the culprit?

Sounds like a red herring then - I’ll look at the job again and see if it is trying to plot something.

Looks like that’s the ticket!

Add [scheduler][host self identification]method=address to your Cylc 8 global config to match the Cylc 7 one.

Explanation: The DNS configuration of Cheyenne is problematic for TCP comms (the FQDNs of compute nodes don’t resolve elsewhere on network e.g. on login nodes) so Cylc 7 was configured to fall back to UDP / IP addresses.

It still seems like there is a long wait for the second job to fire, but I am still poking around to see what is going on. In the meantime, is this an error message worth noting? It appears in job.err

/bin/sh: BASH_XTRACEFD: 19: invalid value for trace file descriptor

For some reason my second job failed to run with:

2021-06-08T09:19:43-06:00 INFO - [world.1] status=submitted: (polled)submission failed at  2021-06-08T09:04:41-06:00  for job(01) flow(r)
2021-06-08T09:19:43-06:00 ERROR - [world.1] -submission failed
2021-06-08T09:19:43-06:00 DEBUG - [world.1] -submitted => submit-failed
2021-06-08T09:19:43-06:00 DEBUG - BEGIN TASK PROCESSING
2021-06-08T09:19:43-06:00 DEBUG - END TASK PROCESSING (took 0.00020265579223632812 seconds)
2021-06-08T09:19:44-06:00 WARNING - Suite stalled with unhandled failed tasks:
     * world.1 (submit-failed)

This is allowing for a demo of the failed job not triggering the workflow to end, because if I run cylc scan the workflow still appears, and cylc get-suite-contact shows the following:

(cylc8b1_v2) bcash@cheyenne5:~/cylc-run/debug_comms/run9/log/suite> cylc get-suite-contact debug_comms/run9
CYLC_API=5
CYLC_SUITE_HOST=128.117.211.248
CYLC_SUITE_NAME=debug_comms/run9
CYLC_SUITE_OWNER=bcash
CYLC_SUITE_PORT=43042
CYLC_SUITE_PROCESS=5242 /glade/u/home/bcash/.conda/envs/cylc8b1_v2/bin/python /glade/u/home/bcash/.conda/envs/cylc8b1_v2/bin/cylc play debug_comms/run9 --debug
CYLC_SUITE_PUBLISH_PORT=43087
CYLC_SUITE_RUN_DIR_ON_SUITE_HOST=/glade/u/home/bcash/cylc-run/debug_comms/run9
CYLC_SUITE_UUID=6ed0bcc7-19b4-4f5b-b1b6-679471318290
CYLC_VERSION=8.0b1
SCHEDULER_CYLC_PATH=None
SCHEDULER_SSH_COMMAND=ssh -oBatchMode=yes -oConnectTimeout=10
SCHEDULER_USE_LOGIN_SHELL=True

It still seems like there is a long wait for the second job to fire

If the job is in the “submitted” state then that means it is in the PBS queue.

If the job is in the “preparing” state then it means Cylc is preparing the job submission. This should be fairly quick (especially if another task has already run on the same platform). This could hang (and timeout) for SSH issues, especially if the platform “hosts” or “install target” are misconfigured.

This job failed to submit and entered the submit-failed state.

If the first job submitted successfully this is likely to do with the task configuration in the flow.cylc file. The job-activity.log file (on the login node) will contain some information which might help debug the issue.

This is allowing for a demo of the failed job not triggering the workflow to end

The workflow did not shut itself down because of the submit-failed job, this is correct. This is to allow automatic or manual retries to remedy the issue.

/bin/sh: BASH_XTRACEFD: 19: invalid value for trace file descriptor

A strange side-effect of debug mode on certain platforms. You can safely ignore this.

The job-activity.log in :~/cylc-run/debug_comms/run9/log/job/1/world/01, where world is my failed task shows:

[jobs-submit ret_code] 0
 [jobs-submit out] 2021-06-08T09:04:40-06:00|1/world/01|0|8632379.chadmin1.ib0.cheyenne.ucar.edu
 2021-06-08T09:04:40-06:00 [STDOUT] 8632379.chadmin1.ib0.cheyenne.ucar.edu
 [jobs-poll ret_code] 0
 [jobs-poll out] 2021-06-08T09:19:42-06:00|1/world/01|{"job_runner_name": "pbs", "job_id": "8632379.chadmin1.ib0.cheyenne.ucar.edu", "job_runner_exit_polled": 1, "time_submit_exit": "2021-06-08T09:04:41-06:00"}

I’m afraid that doesn’t mean much to me. :grinning_face_with_smiling_eyes: I’m going to look at my flow.cylc again and then resubmit to see if this was a one-off problem or if I can find an issue.

EDIT: Ran to completion on the second attempt and did not experience the 15 minute delay. It looks like the change to ‘method=address’ did fix the problem and then there was a random submission failure to spice things up… @oliver.sanders thanks for all the help!

1 Like