Job submitted via Cylc fails, but runs fine with direct sbatch

prajeeshag · August 28, 2025, 12:07pm

I’m running into an issue where a multinode job submitted through Cylc fail with the error:
“cxil_map: write error”

However, if I take the exact job script generated by Cylc and submit it directly with sbatch, the job runs without any problem.

I am using Cylc version 8.5.1.

I’d appreciate any help in debugging this issue.

Thanks in Advance.

oliver.sanders · August 28, 2025, 12:42pm

Hi,

We haven’t had a report of this yet.

Could you provide the full error from the scheduler log and job-activity.log files incase it contains anything helpful.

An AI assistant gave me this response for the error message:

The “cxil_map: write error” typically occurs during inter-node GPU-aware MPI communication, often due to system misconfigurations or insufficient disk space. Ensure your system is updated and check your disk space to resolve this issue.

prajeeshag · August 30, 2025, 7:31am

Hi,

Thanks for the response. I was able to resolve the issue by adding

export FI_CXI_RX_MATCH_MODE=hybrid

to the job script. The problem wasn’t directly related to Cylc, but for some reason only the jobs submitted via Cylc were failing while the same script submitted manually with sbatch worked fine. My guess is that this is due to a difference in the environment when jobs are launched through Cylc verses manually to sbatch from the shell.

Thanks again.

Topic		Replies	Views
Cylc job hanging and failing to submit Cylc Support	2	118	May 20, 2024
Cylc 8.0b1 task communication fails on PBS Cylc Support	15	599	June 8, 2021
Cylc communication issue on specific HPC? Cylc Support	2	574	December 16, 2020
Running Cylc on TGCC’s Irene – Handling Node Changes Between Job Resubmissions Cylc Support	7	75	June 4, 2025
Running on remote Cylc Support	8	781	September 28, 2019

Job submitted via Cylc fails, but runs fine with direct sbatch

Related topics