I’ve been using cylc without issues on a remote SLURM cluster at the start of the year.
Due to other priorities, I’ve put that aside for a few months, and am now coming back to cylc.
However, I am now totally unable to load any cylc suite on the remote server, as every run is met with sumbit-failed
.
Typical job-activity.log
:
[jobs-submit cmd] cylc jobs-submit --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/temp_cylc/run2/log/job' 1/a/01
[jobs-submit ret_code] 1
[jobs-submit out] 2024-10-05T18:05:45+02:00|1/a/01|1|None
2024-10-05T18:05:45+02:00 [STDERR] [Errno 13] Permission denied: 'nohup'
I have tested this both with the “pre-installed” cylc on the cluster (v8.2.2) and with a new install on a virtualenv (v8.3.4), even with the most basic flow possible.
I suppose this is due to a change of config on the SLURM cluster itself, however the error message doesn’t help me in diagnosing what’s going wrong precisely; as a user, I can use nohup
just fine.
Could you assist me in understanding more precisely what’s going wrong there ?
Many thanks.
Hi @abarral
Cylc uses nohup
to detach “background” jobs.
If you are using the Slurm job runner you shouldn’t see that.
If you are intentionally using the background job runner, then the error message seems pretty unambiguous.
as a user, I can use nohup
just fine.
Have you checked on the job platform host(s), where the command is executed by Cylc? (If that is different from your scheduler host, where you run the Cylc scheduler).
Just to confirm, on my system, this is exactly the error I get in the scheduler log if I try to run a background job when nohup
is present in my executable search path but I do not have permission to execute it.
Wow I feel a bit dumb now ^^
Indeed, I had forgotten to switch the env variable that tells cyls to execute via slurm instead of via nohup.
I’m still not sure why it can’t use nohup, since when I login on the compute node I can use it, but it doesn’t matter much since the slurm scheduler works fine.
Thanks !
Wow I feel a bit dumb now ^^
This stuff’s complicated!
I’m still not sure why it can’t use nohup, since when I login on the compute node I can use it, but it doesn’t matter much since the slurm scheduler works fine.
Not sure.
Background jobs get run on the login node not the compute node might have a different environment?
1 Like