Task wrangling issues

Hi.
I’m trying to create a new cylc8 suite based on my cylc7 NWP suite (nothing as simple as helloworld because I’m not that sensible).

My first task kept failing because it took me a while to get all the correct environment variables defined in the task. However, my workflow changes were not being updated in the job script. My work pattern was to modify my sources, run cylc reinstall, cylc reload (and repeat a couple times just in case), then cylc trigger … the task. The new variables defined in the workflow did not appear in the jobscript, even though they did appear in the workflow definition with the output of cylc config.
Why doesn’t reinstalling and reloading reliably update the job script with new changes? (A colleague reported a similar issue).

My second task was failing because it couldn’t find cylc. (Don’t ask me why my first task didn’t have this issue…) I eventually worked out it is because I had the cylc path in global.cylc pointing to someone else’s home directory, and fixed this. However, because the job didn’t get far enough, cylc doesn’t realise it failed. cylc poll returns a submit-failed state.
I did manage to retrigger the task once after a cylc poll, because it changed the state back to waiting, however I still hadn’t fixed the cylc path then. However, thereafter, I was not able to resubmit the task because it remained in the submitted state even when it polls as submit-failed. How do I get the task back to a state where I can submit it again?

Log outputs

2022-08-11T05:51:52Z INFO - [command] poll_tasks
2022-08-11T05:51:52Z INFO - Processing 1 queued command(s)
2022-08-11T05:51:52Z INFO - Command succeeded: poll_tasks([‘20220321T0600Z/fcm_make_um’])
2022-08-11T05:51:53Z INFO - [20220321T0600Z/fcm_make_um submit-failed job:03 flows:1] (polled)submission failed at 2022-08-11T05:15:28Z
2022-08-11T05:51:53Z CRITICAL - [20220321T0600Z/fcm_make_um submit-failed job:03 flows:1] submission failed
2022-08-11T06:00:31Z INFO - [20220321T0600Z/fcm_make_um submitted job:03 flows:1] poll now, (next in PT15M (after 2022-08-11T06:15:31Z))
2022-08-11T06:00:33Z INFO - [20220321T0600Z/fcm_make_um submit-failed job:03 flows:1] (polled)submission failed at 2022-08-11T05:15:28Z
2022-08-11T06:00:33Z CRITICAL - [20220321T0600Z/fcm_make_um submit-failed job:03 flows:1] submission failed
2022-08-11T06:01:19Z INFO - [command] poll_tasks
2022-08-11T06:01:19Z INFO - Processing 1 queued command(s)
2022-08-11T06:01:19Z INFO - Command succeeded: poll_tasks([‘20220321T0600Z/fcm_make_um’])
2022-08-11T06:01:20Z INFO - [20220321T0600Z/fcm_make_um submit-failed job:03 flows:1] (polled)submission failed at 2022-08-11T05:15:28Z
2022-08-11T06:01:20Z CRITICAL - [20220321T0600Z/fcm_make_um submit-failed job:03 flows:1] submission failed
2022-08-11T06:03:22Z INFO - Processing 1 queued command(s)
2022-08-11T06:03:22Z WARNING - [20220321T0600Z/fcm_make_um submitted job:03 flows:1] ignoring trigger - already active
2022-08-11T06:03:22Z INFO - Command succeeded: force_trigger_tasks([‘20220321T0600Z/fcm_make_um’], flow=[‘all’], flow_wait=False, flow_descr=None)

I can’t find a command reference in the documentation? I am working off the command line, no gui, and limited use of tui to check what is happening.

I ended up killing the suite. I couldn’t use cylc stop because the task was “submitted”. cylc stop --kill got stuck in an endless loop of failing to kill the “submitted” task that had actualy failed. cylc stop --now --now stopped the suite. But then I couldn’t cylc play to restart it because the task I removed from the graph hadn’t been manually removed from the workflow. So I did a cylc clean and started over.

Now I’ve fixed the problem with the path to cylc, and my fcm_make_um task is failing, so I’m back to the problem of my changes to the configuration not being propagated into the job script after multiple reinstall and reloads.

BTW using version 8.0rc3 which is not the latest.

[previous unhelpful reply deleted/hidden]

The reload-related bug in 8.0rc3 only affected the preparing state, and in fact I can’t seem to replicate your reload problem here with rc3.

Can you try it with the following minimal example:

[scheduling]
    [[graph]]
        R1 = "a"
[runtime]
    [[a]]
        script = "echo X is $X; false"
        [[[environment]]]
            X = x

The one task, a will fail every time it runs, so you can reinstall → reload → retrigger at will after changing the value of X in the config file.

Yes, we removed that. It’s not really “standard practice” to put command help listings in a user guide. If you find it helpful you can generate the whole lot yourself like this:

for COMMAND in $(cylc help all | awk '{print $1}'); do
   cylc $COMMAND --help
done > cylc-help.txt 

The cylc stop default is to wait for active tasks to complete before shut down, but cylc stop --now overrides that.

I’ll come back to your other issues in a while (gotta go for now)…

Posting a minimal example (such as mine just above) makes it easier for us to understand and replicate your problems!

Other system details may be important too, for issues involving (in)correct tracking of job status, such as: is your job platform remote, does it see the same filesystem as the scheduler host, what is the job runner (PBS, Slurm, …?), what’s the task communications method (TCP, polling, or ssh)?

Minimal examples are good for diagnosing issues, but I never found them much use as a learning tool.

I’m running on the HPC, so no remote system, either running in background or submitting via PBS. Not sure about task communication. I can probably find out.

I have determined that I can reinstall and reload to my heart’s content, but I won’t notice that it didn’t work because there is a syntax error, unless I run cylc validate, or check the workflow log, to find there is a syntax error. Is reinstall and reload (and cylc config) supposed to also validate, or do I need to do this myself every time before I reinstall? cylc config appears to show the new changes even when my submitted job did not, so I suppose it doesn’t care if it’s valid?
So that is probably the cause of some of my woes. I have been looking at my log files though, and I did a lot of reloading that did not appear to have an error associated with it. So I am not sure that 100% of the issues are due to syntax errors.

I can reproduce the issue of being unable to restart a workflow with a task removed.
e.g. Add R1 = “b” to the graph, and define it the same as “a”. Let it fail. then delete it from the graph, (validate), reinstall and reload. “b” is still there (e.g. in tui) because cylc doesn’t automatically remove tasks (I read that in the docs!). So I cylc stop. Then I cylc play again. The workflow fails to restart.

(productive) gadi-login-04:dp9 run1 $ cylc stop hiworld
Done
(productive) gadi-login-04:dp9 run1 $ cylc play hiworld

2022-08-12T04:02:43Z INFO - Extracting job.sh to /home/548/sjr548/cylc-run/hiworld/run1/.service/etc/job.sh
2022-08-12T04:02:43Z INFO - Workflow: hiworld/run1
2022-08-12T04:02:43Z INFO - LOADING workflow parameters
2022-08-12T04:02:43Z INFO - + workflow UUID = 11d32f74-c6cc-4598-be5a-ebbf2d5c4133
2022-08-12T04:02:43Z INFO - + UTC mode = False
2022-08-12T04:02:43Z INFO - + cycle point time zone = Z
2022-08-12T04:02:44Z INFO - LOADING task proxies
2022-08-12T04:02:44Z INFO - + 1/a failed
2022-08-12T04:02:44Z INFO - + 1/b failed
2022-08-12T04:02:44Z ERROR - ‘~sjr548/hiworld/run1//$namespace|b’
Traceback (most recent call last):
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/data_store_mgr.py”, line 818, in generate_ghost_task
task_def = self.data[self.workflow_id][TASKS][t_id]
KeyError: ‘~sjr548/hiworld/run1//$namespace|b’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/scheduler.py”, line 661, in start
await self.configure()
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/scheduler.py”, line 489, in configure
self._load_pool_from_db()
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/scheduler.py”, line 732, in _load_pool_from_db
self.workflow_db_mgr.pri_dao.select_task_pool_for_restart(
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/rundb.py”, line 842, in select_task_pool_for_restart
callback(row_idx, list(row))
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/task_pool.py”, line 456, in load_db_task_pool_for_restart
self.add_to_pool(itask, is_new=False)
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/task_pool.py”, line 199, in add_to_pool
self.create_data_store_elements(itask)
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/task_pool.py”, line 221, in create_data_store_elements
self.data_store_mgr.increment_graph_window(itask)
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/data_store_mgr.py”, line 654, in increment_graph_window
is_orphan = self.generate_ghost_task(s_tokens.id, itask, is_parent)
File “/home/548/jt4085/.conda/envs/productive/lib/python3.9/site-packages/cylc/flow/data_store_mgr.py”, line 820, in generate_ghost_task
task_def = self.added[TASKS][t_id]
KeyError: ‘~sjr548/hiworld/run1//$namespace|b’
2022-08-12T04:02:44Z CRITICAL - Workflow shutting down - ‘~sjr548/hiworld/run1//$namespace|b’
2022-08-12T04:02:44Z WARNING - Incomplete tasks:
* 1/a did not complete required outputs: [‘succeeded’]
2022-08-12T04:02:44Z CRITICAL - Workflow stalled

Fair enough, but personally I’d rather avoid the complexities of managing an enormous NWP workflow (or whatever it might be) when trying to figure how some detail of scheduling or job control with Cylc.

Ah, that could do it. Have you actually confirmed that reinstall + reload has the desired effect in your environment if you didn’t introduce a syntax error?

No, the job of cylc reinstall is just to reinstall files from source to an existing run directory. You should validate again after making any change to the workflow configuration.

The cylc reload command tells the running scheduler to reload the (reinstalled) config file. The scheduler then, in effect, validates the file, because it won’t parse correctly with a syntax error. However, you will have to look at the scheduler log to see those errors, so it’s much better to validate first.

That should not be the case. cylc config parses the config file and prints the parsed result, so it should fail if there’s a syntax error. Can you do a targeted test to confirm that?

Reproduced - I can confirm that’s a bug. Thanks for reporting!

The reload bit is a red herring, it’s just restart after removing a task from the config that’s a problem.

Okay, I have not been able to reproduce the issue of cylc config showing me a different task environment to what is in the job script. cylc config is now giving me parsing errors if the workflow is invalid (I don’t recall if that’s what I saw the other day). I’m not really sure how I got to that state. But I will have a better idea of what to look for in future in case it happens again.

I can reproduce it in my complex suite.

(productive) gadi-login-04:dp9 run1 $ cylc reinstall
REINSTALLED suite_test/run1 from /g/data/dp9/sjr548/suite_test
(productive) gadi-login-04:dp9 run1 $ cylc reload suite_test
Done
(productive) gadi-login-04:dp9 run1 $ cylc validate .
Valid for cylc-8.0rc3
(productive) gadi-login-04:dp9 run1 $ cylc trigger suite_test //20220321T0600Z/pfx_um_recon_cold
Done
(productive) gadi-login-04:dp9 run1 $ less log/job/20220321T0600Z/pfx_um_recon_cold/NN/job
(productive) gadi-login-04:dp9 run1 $ cylc config suite_test > parsed_conf
(productive) gadi-login-04:dp9 run1 $ less parsed_conf

contents of job script

# Job submit method: background

contents of parsed_conf

[[pfx_um_recon_cold]]
script = rose task-run --verbose --verbose --path=“site/nci/bin”
platform = PBS

[[[directives]]]
-W umask = 0022
-P = dp9
-l storage = gdata/dp9+gdata/access+scratch/access+gdata/ig2
-q = normal
-l ncpus = 16
-l mem = 64G

And then I modified the -l storage directives, reinstalled and reloaded. cylc config shows the updated detail. If I retrigger the task, it is still submitting to the background. (I had forgotten to inherit the HPC family to set the platform originally.) So it is persistently wrong. I have managed to run via PBS for another task, so that should be correctly configured.

Now I think about it, the previous case also involved modifications to inheritance. Could this somehow be a causal factor to the task config getting stuck in the past?

stopping and restarting the suite got the task to update to run on PBS.

Is there a way to preview a jobscript? So I can check changes before triggering the task.

Jobscripts are generated at the last moment before job submission, and Cylc 8 doesn’t provide a way to “preview” one because it complicates the handling of job submission number and workflow provenance.

You can manually trigger the task, and look at the resulting job script then. Another good reason to use a quick dummy workflow to figure out issues like this - as Cylc does not care if the task runs “sleep 10” or a massive scientific model.

By the way, I was unable to reproduce this problem the other day (forgot to post another reply, sorry). If it works in a dummy workflow it should work in your real one, if the platform and job runner are the same. If you confirm again that the dummy case works, try looking for more information in the logs of the real case (job.out, job.err, job.activity, and the scheduler log).

@srennie - we might have run into your reload bug; can you confirm you were running in backward compatibility mode above? (which is switched on by the old suite.rc config filename).

No, I was working in a pure cylc8 suite, no legacy files at all.

OK, so much for that idea - thanks!