Cylc job hanging and failing to submit

Currently trying to run a workflow on Cylc 8.2.4.

When it gets to a certain job, it hangs on the job without submitting. I get the following messages in the job-activity.log file:

[jobs-poll ret_code] 0
[jobs-poll out] 2024-05-20T10:48:35+12:00|20140101T0000Z/atmos_main/01|{“job_runner_name”: “slurm”, “job_id”: “3987323”, “job_runner_exit_polled”: 0, “time_submit_exit”: “2024-05-20T10:33:32+12:00”}
[jobs-poll ret_code] 0
[jobs-poll out] 2024-05-20T11:03:35+12:00|20140101T0000Z/atmos_main/01|{“job_runner_name”: “slurm”, “job_id”: “3987323”, “job_runner_exit_polled”: 0, “time_submit_exit”: “2024-05-20T10:33:32+12:00”}
[jobs-poll ret_code] 0

This messages is repeating every 15m minutes with a new timestamp, as I’m assuming it tries to execute the job but fails? Any ideas what would be causing this issue?

I am also getting an error at the start of the job-activity.log file reading:

2024-05-19T22:33:32Z [STDERR] sbatch: error: plugin_load_from_file: dlopen(/opt/cray/pe/atp/libAtpSLaunch.so): /opt/cray/pe/atp/libAtpSLaunch.so: cannot open shared object file: No such file or directory
2024-05-19T22:33:32Z [STDERR] sbatch: error: spank: /opt/cray/pe/atp/libAtpSLaunch.so: Dlopen of plugin file failed

This error doesn’t happen every time I try submit and run the workflow.

No. Your job has been submitted sucessfully and Cylc is now waiting for the job to run - the UI should show the task in the submitted state. Cylc polls the job (by default every 15 minutes) to check whether it is still in the queue. You can see from the poll output that your job was submitted at 2024-05-20T10:33:32+12:00 and that the job id is 3987323. So, you need to query Slurm (using squeue) to see why your job isn’t running.

The sbatch errors need reporting to your Slurm administrators.

Thanks for the help, I’ll contact my local support

1 Like