Complex scenario but the short of it is we have some custom PBS hooks written in ksh (with Bash-incompatible usage like typesets) for pre and post handling of jobs which relies on a custom ksh script being imported into jobs at runtime. Cylc jobs are Bash so this breaks and as a result the PBS exit handler reports the job as failed (due to the typeset errors reported by Bash) even though the job script itself succeeds. All our PBS reporting tools reflect the non-0 exit code. However, the Cylc task reports (correctly) a successful run.
My attempt to describe this in stages:
- Cylc job script submitted
- PBS pre-handler works fine; prints header info which will appear in job.out
- Cylc job script completes - exit 0 ā Cylc sees this and is what gets logged by Cylc
- PBS post-handler fails - exit 1 ā PBS sees this and this is what gets logged in PBS and is printed in the job.out
How does Cylc get the job exit code from stage 2 if PBS thinks the job failed in stage 3?
Hi @russbnavy
Cylc generates self-contained job scripts that consist of your user-defined job scripting wrapped in error detection and messaging code to report job status back to the Cylc server. So we donāt rely on (e.g.) PBS to report job success or failure.
I donāt have access to PBS at the moment, so maybe others can help out, but I would have thought that PBS hooks are separate executables that should not interact with the Cylc job script in any way - unless they are used to modify the job script before running it?? (in which case your PBS hook should presumably not be assuming that the job script is written in ksh). But from your description above the Cylc job runs fine and it is only the PBS post hook that fails, so Iām struggling to see if/where there is a problem with Cylc here?
Sorry, I should clarify: thereās no issue with how Cylc is handling the job. Cylc is reacting correctly and thatās what made me curious how itās determining the true job status and ignoring what PBS is putting out. The PBS event scripts are not injected into the job script itself but they do append the job output. You answered my question though, in that the Cylc messaging code is used instead of the PBS-reported exit code.
1 Like
Ok, I have a follow-up. On a seperate system, we periodically see a discrepancy in Cylc behavior vs whatās in the logs. Cylc will report a task as āsubmit-failedā but the job output says it completed successfully with an exit 0.
job-activity.log
shows:
[jobs-submit ret_code] 0
[jobs-submit out] 2021-09-16T15:35:50Z|20210916T0600Z/task1/01|0|12345.pbsserver
2021-09-16T15:35:50Z [STDOUT] 12345.pbsserver
[jobs-poll ret_code] 0
[jobs-poll out] 2021-09-16T15:37:31Z|20210916T0600Z/task1/01|{"batch_sys_name": "pbs", "batch_sys_job_id": "12345.pbsserver", "batch_sys_exit_polled": 1, "time_submit_exit": "2021-09-16T15:35:53Z"}
[(('event-handler-00', 'submission failed'), 1) cmd] cylc email-task 'submission failed' 'suite1' 'task1.20210916T0600Z' 'job submission failed'
[(('event-handler-00', 'submission failed'), 1) ret_code] 0
[(('event-handler-00', 'submission failed'), 1) out] cylc hook email-task: Sending email to admin@abc
But then job.status
says:
CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=12345.pbsserver
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2021-09-16T15:35:53Z
CYLC_JOB_PID=54321
CYLC_JOB_INIT_TIME=2021-09-16T15:37:21Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2021-09-16T15:37:22Z
CYLC_BATCH_SYS_EXIT_POLLED=2021-09-16T15:37:31Z
As far as I know, the PBS system weāre using for this job is NOT doing any post-job handlingā¦
Hi @russbnavy
From the job activity log, it appears that your job submits correctly but then gets polled very soon after by the Cylc server, and that poll returns the wrong result.
On this platform are you configuring Cylc to track jobs by polling, or polling in addition to task messaging?
In the past weāve seen job polling get the wrong result where the batch system returns job ID in some unexpected format that isnāt parsed correctly by Cylc. But I think thatās been fixed for all formats that we know of, and if that was the problem your polling should fail to get the correct job status every time, not just periodically as you say.
I wonder if there is a race condition here, between the job finishing and being polled, given that polling is a process that can take a non-trivial amount of time, and your job poll output appears pretty soon after job submission. Does the suite log give any more information, such as when the poll was initiated? And can you reproduce this easily with a test suite?
Hilary
1 Like
Hi @russbnavy. I think I have managed to reproduce how this could happen - see https://github.com/cylc/cylc-flow/issues/4431
1 Like
Excellent, @dpmatthews! Yes your bug report sounds very much like what was happening. Itās not normal functionality but we are on a shared system and so the occasional developer forgets an exit condition in their qsub loop and then the whole system starts crawling. Itās not bad enough to bother the devs but it most definitely interrupts operations which are very sensitive to such system slow-downs.
But fix merged and released in Cylc-7.9.5 and 7.8.10 (just announced).
Awesome! I had our admin install it. Thanks everyone!