Hi everyone,
I’m hoping you all can help me again. I’ve been having issues getting my Cylc suite running on one of our specific HPCs. I initially was using Cylc 7, but after downtime, I’ve started having issues with it. Cylc 7.9.1 no longer builds on the system–I’ve downgraded to Cylc 7.8.4, which builds, however, the cylc monitor can never connect to a running suite (it looks like the suite isn’t running at all).
I’ve temporarily swapped over to the Cylc 8 preview–and have had more luck with that. Initially, cylc monitor would work, but the tasks would not update as they errored out. I’ve reconfigured my global.rc file to use polling, and set the execution and submission polling intervals to PT1M. That has allowed the cylc monitor to realize that my tasks are failing, instead of just being in a permanent submitted state.
I’m now trying to diagnose why these tasks are failing. The suite works fine on another HPC, and the task is actually simple–just creating some scratch directories to prep for other tasks. This is all I’m seeing from my job.err file:
ClientTimeout: Timeout waiting for server response.
2020-12-15T02:13:03Z CRITICAL - failed/EXIT
ClientTimeout: Timeout waiting for server response.
The job.out file is also not very helpful:
Suite : ionoda_sample_project
Task Job : 20140101T0000Z/create_work_folders/01 (try 1)
User@Host: tyndall@batch3
2020-12-15T02:12:57Z INFO - started
Is the reported timeout related to the node not being able to communicate back to the daemon running on the login node? That’s all I can think of causing the problems. I’m not quite sure what I’m looking at here, so if anyone has any ideas, I’m open to it.
Also willing to go back to Cylc 7, if anyone has any idea how to get that working. Thanks,
-Dan