Hi,
Could somebody please point me to where in the code (7.8.6) I can catch the output from the polling command and where cylc uses the output to determine if the job is still running or not? I’m going round in circles currently; there are quite a few polling routines. I believe the poll command (slurm) is squeue -h -j <jobids>
. The problem we have is cylc is incorrectly determining that the heterogeneous job has failed when it (all constituent parts) is actually still running.
Thanks.
Regards,
Ros.
Hi @Rosalyn_Hatcher,
I’ve just confirmed the problem. I’m guessing it’s that Cylc is expecting a single job to be returned by the job ID query, but you get multiple jobs in this case. Cylc is not expecting the job ID extensions for heterogeneous jobs (ID+0, ID+1 etc.) in job poll (query) output. Looking into it…
Hilary
Hi again Ros,
Could somebody please point me to where in the code (7.8.6) I can catch the output from the polling command and where cylc uses the output to determine if the job is still running or not? I
That may have been hard to follow (it was hard enough for me!) because we’ve abstracted out as much of the “batch system handler” code as possible, and each particular batch system only implements the bits of the generic interface that it needs.
In this case, I had to implement a poll output filtering method that was not previously used in the Slurm handler.
I’ve posted a new pull request up to the Cylc 7.8.x branch on the repository for heterogenous job support: Slurm heterogeneous job support 7.8.x by hjoliver · Pull Request #3970 · cylc/cylc-flow · GitHub
Hopefully this will be merged for a small Cylc 7 update release soon. You already have the first part of it, of course. The final commit on the branch has the poll filtering fix: Slurm heterogeneous job support 7.8.x by hjoliver · Pull Request #3970 · cylc/cylc-flow · GitHub
Try that for size and let us know on the GitHub PR page if you have any questions or run into problems.
Hilary
1 Like