Errors in global init-script causing tasks to disappear in Cylc 8

I’m having an interesting issue where if our global init-script (as defined in global.cylc) contains errors, tasks will disappear from the graph but never get submitted and any down-stream dependencies in the same cycle point also disappear and it moves on to the next cycle point.

I didn’t see anything obvious in the configuration reference that would speak to this since the task itself isn’t actually even running, it’s more like a prep phase issue but Cylc’s default behavior seems to be to ignore it and move on.

Is there a way to tell it not to do that so if there’s a breaking environment change we’ll be able to catch it?

Thanks!

That certainly shouldn’t happen. The global init-script only effects a job when it runs. So, what I would expect to happen is that the task gets submitted but then fails to do anything when it runs. Cylc should continue to show the task as submitted until it polls the task at which point it should show it as submit-failed. Unless you handle submit-failures in your graph the workflow should stall.

We’ll need to see the relevant bits of the scheduler log to work out what’s happening.

Also, make sure you are using 8.2.2 or later. Before this there was a bug where submit-failed tasks were treated as completed.

Thanks, sorry for the late reply I was trying to get a bare-bones suite going to replicate the issue to make sure it isn’t something about the larger suite I’m having some “filter hook” error that I’ve never seen before so that’s a bit delayed. I’ll try replicating elsewhere today.

Confirmed I am using 8.2.2, thanks!

To add since I didn’t say so in my original post: once I found the command typo in the init-script and fixed it, the suite ran as expected.

@russbnavy - I’ll be interested to see if you can provide a reproducible example of this. I can’t replicate the problem at 8.2.2.

Global init scripting (defined per platform in global.cylc) runs first thing in every job script, but it is somewhat insulated from the job abort-on-error system - otherwise a typo in a site global config could make every job fail for all users.

# global.cylc
[platforms]
    [[localhost]]
        global init-script = "echo GLOBAL INIT; typo"

Test workflow:

[scheduling]
    [[graph]]
        R1 = "foo"
[runtime]
    [[foo]]
        script = "echo HELLO"

I get this in job.err:

$ cylc log -f e bug//1/foo
.../log/job/1/foo/01/job: line 45: typo: command not found

But the job runs just fine despite that:

$ cylc log -f o bug//1/foo
GLOBAL INIT
Workflow : bug/run29
Job : 1/foo/01 (try 1)
User@Host: oliverh@NIWA-1022450

HELLO
2023-11-09T10:56:39+13:00 INFO - started
2023-11-09T10:56:39+13:00 INFO - succeeded

If I put set -e (abort the shell on error) in the global init script, that will result in the job appearing to be stuck as “submitted” (because the init script then aborts before the job started message can be sent).

Then, if I run cylc poll bug//1/foo (or wait for Cylc to poll the job) and the job status will update to submit-failed.

Actually the bug was that submitted was not a required output. So a submit-failed task could be removed as complete only if the succeeded output was optional. Otherwise, that would cause a stall even before 8.2.2.

So I can only replicate the problem if:

  • the global init script has a typo AND uses set -e to abort on any error
  • jobs get polled to detect the submit-failed state (which will happen automatically, after a while)
  • job success is marked as optional in the graph, i.e., foo? or foo:succeed?
  • and I’m running cylc-8.2.1 or earlier

Then, submit-failed tasks will be removed as completed (and tasks downstream of them therefore won’t run).

But at 8.2.2, even that will not cause your problem because submitted is now correctly treated as a required output:

INFO - New flow: 1 (original flow from 1) 2023-11-09 11:26:05
INFO - [1/foo waiting(runahead) job:00 flows:1] => waiting
INFO - [1/foo waiting job:00 flows:1] => waiting(queued)
INFO - [1/foo waiting(queued) job:00 flows:1] => waiting
INFO - [1/foo waiting job:01 flows:1] => preparing
INFO - [1/foo preparing job:01 flows:1] submitted to localhost:background[7465]
INFO - [1/foo preparing job:01 flows:1] => submitted
...
INFO - [command] poll_tasks
INFO - Command actioned: poll_tasks(['1/foo'])
INFO - [1/foo submitted job:01 flows:1] (polled)submission failed
CRITICAL - [1/foo submitted job:01 flows:1] submission failed
INFO - [1/foo submitted job:01 flows:1] => submit-failed
WARNING - [1/foo submit-failed job:01 flows:1] did not complete required outputs:
    ['submitted']
ERROR - Incomplete tasks:
      * 1/foo did not complete required outputs: ['submitted']
CRITICAL - Workflow stalled
WARNING - PT1H stall timer starts NOW

Ok, I was unable to replicate the issue with a minimal test suite (same conda env and same Cylc version) on another cluster so it must be something else. Sorry!

Edit: and the fact the minimal test suite works on the alternate cluster and not on the original cluster further supports the theory that there’s something about the original cluster not playing nice with Cylc 8. I’ll try to double check that all the configs are the same.

Edit 2: Well now it started happening on the alternate cluster but I was able to replicate with my minimal test suite. It seems to be when PBS throws the following error, Cylc doesn’t know what to do with it:

qsub: request rejected as filter hook ‘main_hook’ encountered an exception. Please inform Admin

Also, it doesn’t appear to be spawning/running downstream dependent tasks either, just the affected task disappears from the graph. I tried running with debugging enabled but didn’t get any other details at the job level. Relevant flow log below which shows Cylc is indeed detecting the submit failure but then it removes the task thinking it’s a proxy?

2023-11-09T18:51:09Z INFO - Command actioned: force_spawn_children(['20231110T0000Z/test_B'], outputs=['succeeded'], flow_num=None)
2023-11-09T18:51:09Z INFO - [20231110T0000Z/test_C waiting(queued) job:00 flows:none] => waiting
2023-11-09T18:51:09Z INFO - [20231110T0000Z/test_C waiting job:01 flows:none] => preparing
2023-11-09T18:51:09Z DEBUG - REMOTE INIT NOT REQUIRED for localhost
2023-11-09T18:51:09Z DEBUG - [20231110T0000Z/test_C preparing job:01 flows:none] host=localhost
2023-11-09T18:51:09Z DEBUG - ['jobs-submit', '--debug', '--utc-mode', '--path=/bin', '--path=/usr/bin', '--path=/usr/local/bin', '--path=/sbin', '--path=/usr/sbin', '--path=/usr/local/sbin', '--', '$HOME/cylc-run/test_8/run6/log/job'] ... # will invoke in batches, sizes=[1]
2023-11-09T18:51:09Z DEBUG - 20231110T0000Z/test_C -triggered off ['20231110T0000Z/test_B'] in flow none
2023-11-09T18:51:09Z DEBUG - ['cylc', 'jobs-submit', '--debug', '--utc-mode', '--path=/bin', '--path=/usr/bin', '--path=/usr/local/bin', '--path=/sbin', '--path=/usr/sbin', '--path=/usr/local/sbin', '--', '$HOME/cylc-run/test_8/run6/log/job', '20231110T0000Z/test_C/01']
2023-11-09T18:51:10Z DEBUG - [jobs-submit cmd] cylc jobs-submit --debug --utc-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/test_8/run6/log/job' 20231110T0000Z/test_C/01
    [jobs-submit ret_code] 0
    [jobs-submit out]
    [TASK JOB SUMMARY]2023-11-09T18:51:10Z|20231110T0000Z/test_C/01|32|None
    [TASK JOB COMMAND]2023-11-09T18:51:10Z|20231110T0000Z/test_C/01|[STDERR] qsub: request rejected as filter hook 'main_hook' encountered an exception. Please inform Admin
2023-11-09T18:51:10Z ERROR - [jobs-submit cmd] cylc jobs-submit --debug --utc-mode --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin -- '$HOME/cylc-run/test_8/run6/log/job' 20231110T0000Z/test_C/01
    [jobs-submit ret_code] 32
    [jobs-submit out] 2023-11-09T18:51:10Z|20231110T0000Z/test_C/01|32|None
2023-11-09T18:51:10Z DEBUG - [20231110T0000Z/test_C preparing job:01 flows:none] (internal)submission failed at 2023-11-09T18:51:10Z
2023-11-09T18:51:10Z CRITICAL - [20231110T0000Z/test_C preparing job:01 flows:none] submission failed
2023-11-09T18:51:10Z INFO - [20231110T0000Z/test_C preparing job:01 flows:none] => submit-failed
2023-11-09T18:51:10Z DEBUG - [20231110T0000Z/test_C submit-failed job:01 flows:none] task proxy removed (finished)

then it removes the task thinking it’s a proxy

That’s just terminology:

  • “task” - an abstract entity in the graph, representing something you want to run
  • “job” - what actually gets submitted to run, when the task’s dependencies are met
  • “task proxy” - an object in the scheduler that represents the task during its time in the active window of the graph

Well, in Cylc 8 downstream tasks are only spawned when the upstream outputs they depend on are generated. If your task fails to submit, and is then removed, it will have generated no outputs and so no downstream tasks will be spawned.

So the real question is, why did the submit-failed task get removed as finished, rather than retained as incomplete to stall the workflow?

Can you confirm that you are indeed running 8.2.2 here? And does your graph have any optional outputs (i.e., ?).

BTW your log also reveals that the removed task does not belong to a flow - flows:none - which means it would not spawn children even if it did run and generate outputs - see “triggering a flow-independent task” in Concurrent Flows — Cylc 8.2.2 documentation

[Note that in the current release cylc set-outputs spawns tasks with flow=none by default, so I expect you have done that. The next minor release 8.3 makes this consistent will cylc trigger in terms of default flow assignment].

If I contrive to get a submit failure at 8.2.2, even with a flow=none task, it does not get removed:

ERROR - [jobs-submit cmd] cylc jobs-submit --debug --path=/bin --path=/usr/bin --path=/usr/local/bin --path=/sbin --path=/usr/sbin --path=/usr/local/sbin
    -- '$HOME/cylc-run/gub/run15/log/job' 1/bar/01
    [jobs-submit ret_code] 1
    [jobs-submit out] 2023-11-10T14:40:24+13:00|1/bar/01|1|
DEBUG - [1/bar preparing job:01 flows:none] (internal)submission failed
CRITICAL - [1/bar preparing job:01 flows:none] submission failed
INFO - [1/bar preparing job:01 flows:none] => submit-failed
WARNING - [1/bar submit-failed job:01 flows:none] did not complete required outputs: ['submitted']
ERROR - Incomplete tasks:
      * 1/bar did not complete required outputs: ['submitted']
CRITICAL - Workflow stalled

Yeah:

$ cylc version
8.2.2

I was using the GUI to do things, but here’s how I started up the suite:

cylc install test_8
cylc play --pause test_8

Then I use the GUI to hold the following task pattern:
*/*
Then I resume the workflow and trigger test_C. Does “trigger” use “set-outputs”?

No, trigger doesn’t use set-outputs. It won’t trigger a task with flow=none unless you ask for it (withcylc trigger --flow=none).

From the top of your scheduler log:

Command actioned: force_spawn_children(
    ['20231110T0000Z/test_B'], outputs=['succeeded'], flow_num=None)

That means you must have used the set-outputs command manually, not trigger. And at 8.2.2 that command still defaults to flow=none.

(To explain the function name, what set-outputs does is spawn the children of the outputs that you “set” just as if the outputs had been generated naturally)

So that would seem to explain why you have flows:none, and hence why tasks downstream of test_C would not be spawned if test_C actually ran.

But unfortunately it does not explain why test_C would be removed on submit-failure. That’s still a mystery! It should look like my log above, where the submit-failed task is retained as in incomplete task, which will cause the workflow to stall and await intervention.

Ok, I think I finally understand the role of flow. I’ll experiment more with it. I’m thinking this might be specific to our clusters, but it would certainly be catastrophic if things just started moving on doing effectively nothing so I’m curious what’s happening that’s preventing Cylc from handling that PBS failure.

I’m using the latest Mambaforge installed to $HOME and the latest cylc-flow from conda-forge…

Great. Basically, a “flow” is a self-consistent run through the graph, starting from some task or tasks.

If you haven’t already done so, read this: Concurrent Flows — Cylc 8.2.2 documentation - and let us know if you think anything needs to be clarified.

Also, the upcoming 8.3.0 release will have several major enhancements on the manual intervention front (triggering tasks, setting prerequisites and outputs, and starting or not starting new flows). After that, we plan to add detailed documentation on how to handle various manual interventions - so you’re a little ahead of the game!

A-ha!!! @russbnavy - you’re not running in Cylc 7 back-compat mode are you? (i.e., with a suite.rc config file, not flow.cylc). The 8.2.3 release announcement reminded me that back-compat mode had a more severe version of the submit-failed handling bug.

Running my test case in back-compat mode does indeed result in this:

CRITICAL - [1/D submitted job:01 flows:none] submission failed
INFO - [1/D submitted job:01 flows:none] => submit-failed
DEBUG - [1/D submit-failed job:01 flows:none] task proxy removed (finished)  # !!!

Hopefully that explains it. If so, there’s nothing wrong with your system. You just need to upgrade to 8.2.3, or upgrade your workflow config to Cylc 8.

1 Like

Yup, that’s it. Seeing as how I’m single-handedly migrating our 38 Cylc 7 suites to a new cluster and supporting six different clusters it’s going to take me a while to get everything to full Cylc 8. I haven’t even really gotten Cylc 8 through all our red tape… But I will say I’m digging the new web interface and using it via SSH tunneling has been working great, aside from some QoL items that I’m still learning about (like viewing job logs for completed jobs no longer visible in the flow).

1 Like