Cylc communication issue on specific HPC?

dtyndall · December 15, 2020, 2:31am

Hi everyone,

I’m hoping you all can help me again. I’ve been having issues getting my Cylc suite running on one of our specific HPCs. I initially was using Cylc 7, but after downtime, I’ve started having issues with it. Cylc 7.9.1 no longer builds on the system–I’ve downgraded to Cylc 7.8.4, which builds, however, the cylc monitor can never connect to a running suite (it looks like the suite isn’t running at all).

I’ve temporarily swapped over to the Cylc 8 preview–and have had more luck with that. Initially, cylc monitor would work, but the tasks would not update as they errored out. I’ve reconfigured my global.rc file to use polling, and set the execution and submission polling intervals to PT1M. That has allowed the cylc monitor to realize that my tasks are failing, instead of just being in a permanent submitted state.

I’m now trying to diagnose why these tasks are failing. The suite works fine on another HPC, and the task is actually simple–just creating some scratch directories to prep for other tasks. This is all I’m seeing from my job.err file:

ClientTimeout: Timeout waiting for server response.
2020-12-15T02:13:03Z CRITICAL - failed/EXIT
ClientTimeout: Timeout waiting for server response.

The job.out file is also not very helpful:

Suite : ionoda_sample_project
Task Job : 20140101T0000Z/create_work_folders/01 (try 1)
User@Host: tyndall@batch3

2020-12-15T02:12:57Z INFO - started

Is the reported timeout related to the node not being able to communicate back to the daemon running on the login node? That’s all I can think of causing the problems. I’m not quite sure what I’m looking at here, so if anyone has any ideas, I’m open to it.

Also willing to go back to Cylc 7, if anyone has any idea how to get that working. Thanks,

-Dan

oliver.sanders · December 15, 2020, 11:03am

Hello,

I’ve temporarily swapped over to the Cylc 8 preview

We wouldn’t recommend using the Cylc8 alpha releases for anything other than evaluation purposes. We will have a beta release out before too long that should be more suitable to test usage.

Is the reported timeout related to the node not being able to communicate back to the daemon running on the login node?

Yes, it sounds like you have communication issues:

ClientTimeout: Timeout waiting for server response.

This means that the job was unable to communicate its status back to the workflow daemon (Cylc scheduler).

but the tasks would not update as they errored out

This is because they cannot communicate back to report their status.

I’m now trying to diagnose why these tasks are failing

This communication error will not cause the task itself to fail. The reason these tasks are failing is most likely to do with the script itself and not Cylc.

Things you can try to get more information:

Run the suite in debug mode then check the job.xtrace file, this will help you find the line of the job script where the execution errored:
- cylc run --debug (with Cylc)
- rose suite-run -- --debug (with Rose)
Try running dummy tasks (e.g. echo 'hello world') and confirm you can get those passing (and sort out comms) before moving on to more involved tasks.
Try running the job script manually, this cuts Cylc out of the equation (task communication aside).

Cylc communication issue on specific HPC

Unfortunately it is rather difficult to debug these issues without poking at the system in question. Some thoughts on things to check:

Are the required dependencies installed at the correct versions?
- Try running the cylc check-software command.
Are jobs running using the same version of Cylc as the Cylc Scheduler?
- Multiple installed versions can clash (Note Cylc7 uses HTTPS & Cylc8 uses TCP).
- We advise using a wrapper script to handle this.
- To confirm the correct pairing you can add cylc version to the task script.
Are you able to make HTTP(S) connections between where the jobs are running and where the Scheduler is? You may need to configure the ports the Cylc Scheduler runs on.

In the extreme case you can always temporally configure polling task communications as you did for Cylc8 (works the same in Cylc7).

Others may have more suggestions and debugging tools, let us know what you find.

dtyndall · December 16, 2020, 5:25am

Oliver,

Thanks a lot for your help. I ended up working with @Tim_Whitcomb on diagnosing issues with this HPC. I did have a problem with my Cylc script, but it looks like there were other machine-dependent problems as well.

Cylc 7 still does not run on the machine, but Cylc 8 does. Tim suspects that it’s because of the communication changes in Cylc between these versions. With the Cylc 8 installation on this machine, we also needed to configure the machine to use polling for the task communication. This was the only way to get suite to hear back from the individual tasks.

Totally understand that Cylc 8 is only in preview right now–this is fine for us, since this is only supporting a research effort. I’m using Cylc 7 everywhere else that is dependent on operations.

-Dan

Topic		Replies	Views
Cylc internode communication - when things go wrong Cylc Support	5	184	January 15, 2024
Cylc tui and web ui not detecting running jobs? Cylc Support	5	240	December 14, 2023
Repetitive timeouts in the UI server Cylc Support	8	29	August 5, 2024
Cylc 7.8.4 started with a blank GUI and showed status with "stopped with submitted" Cylc Support	7	1129	April 15, 2020
Suite is running but Cylc Support	2	435	October 1, 2019

Cylc communication issue on specific HPC?

Related topics