Job submitted via Cylc fails, but runs fine with direct sbatch

I’m running into an issue where a multinode job submitted through Cylc fail with the error:
“cxil_map: write error”

However, if I take the exact job script generated by Cylc and submit it directly with sbatch, the job runs without any problem.

I am using Cylc version 8.5.1.

I’d appreciate any help in debugging this issue.

Thanks in Advance.

Hi,

We haven’t had a report of this yet.

Could you provide the full error from the scheduler log and job-activity.log files incase it contains anything helpful.

An AI assistant gave me this response for the error message:

The “cxil_map: write error” typically occurs during inter-node GPU-aware MPI communication, often due to system misconfigurations or insufficient disk space. Ensure your system is updated and check your disk space to resolve this issue.

Hi,

Thanks for the response. I was able to resolve the issue by adding

export FI_CXI_RX_MATCH_MODE=hybrid

to the job script. The problem wasn’t directly related to Cylc, but for some reason only the jobs submitted via Cylc were failing while the same script submitted manually with sbatch worked fine. My guess is that this is due to a difference in the environment when jobs are launched through Cylc verses manually to sbatch from the shell.

Thanks again.

1 Like