I am porting my old cylc7 suites to cylc8.
Here I am describing an issues I was facing when my workflow stalled and never progresses past the initial cycle point.
I’m starting the suite with
extract from rose-suite.conf
:
[template variables]
START_CYCLE="20231101T0000Z"
...
extract from flow.cylc
:
[scheduling]
initial cycle point constraints = T00, T12
initial cycle point = {{ START_CYCLE }}
runahead limit = P1D
...
So the first cycle point is 20231101T0000Z
When I ran the suite it never got beyond that initial cyclepoint (other than the runahead) and eventually stalled. To investigate I revived it with cylc play
and ran some checks:
$ cylc dump downloader_suite_cl4 | grep cycle
newest active cycle point=20231103T0000Z
oldest active cycle point=20231101T0000Z
So the oldest active cycle point
hadn’t moved, it was stuck. So I inspected all tasks at that cycle point with
$ cylc show downloader_suite_cl4 "//20231101T0000Z/*" | grep state
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
state: succeeded
So all tasks that had been instantiated at that cycle point had been completed.
I looked at the status of thos tasks in more detail:
$ cylc dump -t downloader_suite_cl4 | grep 20231101T0000Z
crop_gpm_precip, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_med_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_med_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_medwam_analysis_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_medwam_forecast_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_mfwam_analysis_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_mfwam_forecast_00, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_psy4_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_cmems_psy4_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_ecmwf_era5, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
get_gpm_precip, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
housekeep, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_med_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_med_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_psy4_analysis, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
run_uscc_psy4_forecast, 20231101T0000Z, succeeded, not-held, not-queued, not-runahead
Still nothing. Everything that needs to run has succeeded.
In my suites the housekeep task is intended to be the last task that runs in each cycle point. So I inspected this task:
$ cylc show downloader_suite_cl4 "//20231101T0000Z/housekeep"
title: (not given)
description: (not given)
URL: (not given)
state: succeeded
prerequisites: (n/a for past tasks)
outputs: ('-': not completed)
- 20231101T0000Z/housekeep expired
+ 20231101T0000Z/housekeep submitted
- 20231101T0000Z/housekeep submit-failed
+ 20231101T0000Z/housekeep started
+ 20231101T0000Z/housekeep succeeded
- 20231101T0000Z/housekeep failed
AGain, the housekeep task has succeeded, the suite should move on:
So I inspected the shceduler logs:
$ grep 'is waiting on' /home/cylc/cylc-run/downloader_suite_cl4/run2/log/scheduler/log | grep 20231101T0000Z
* 20231101T0000Z/upd_convert_gfs is waiting on ['20231101T0000Z/upd_get_ncep_gfs:succeeded']
* 20231101T0000Z/upd_convert_gfs is waiting on ['20231101T0000Z/upd_get_ncep_gfs:succeeded']
So that tells me the cycle point is stuck because 20231101T0000Z/upd_convert_gfs
is waiting for 20231101T0000Z/upd_get_ncep_gfs
This task has clock-trigger upd_get_ncep_gfs(+PT3H30M)
and a clock-expire of upd_get_ncep_gfs(P9D)
, so it should have expired by now. So why is upd_convert_gfs
still waiting for upd_get_ncep_gfs
?
The relevant section of the graph looks like this:
[[[T00]]]
graph = """
upd_get_ncep_gfs? => upd_convert_gfs
(sst_get_ncep_gfs[+PT18H] | sst_get_ncep_gfs[+PT18H]:expired) => upd_convert_gfs
(upd_get_ncep_gfs:expired | upd_convert_gfs) => housekeep
"""
The intention is for upd_convert_gfs
to use the outputs of sst_get_ncep_gfs
and upd_get_ncep_gfs
, whereby the sst data is optional as it expires sooner. The convert task may run with gfs+sst data or just with gfs data.
But the upd_convert_gfs
should definitely not run if neither tasks could have been completed, so should never be instantiated if both have expired.
And I think this is where my problem was - something had caused the upd_convert_gfs
task to be instantiated, but now it was sitting there holding everything up.
I then realised that the problem must be in this line (sst_get_ncep_gfs[+PT18H] | sst_get_ncep_gfs[+PT18H]:expired) => upd_convert_gfs
and I changed the graph to be like this:
[[[T00]]]
graph = """
upd_get_ncep_gfs? => upd_convert_gfs
(sst_get_ncep_gfs[+PT18H]? | sst_get_ncep_gfs[+PT18H]:expired?) => upd_convert_gfs
(upd_get_ncep_gfs:expired | upd_convert_gfs) => housekeep
"""
I added the question mark to all tasks that could spawn a upd_convert_gfs
task.
And now the suite runs without stalling.
I hope this helps someone.
I also have two questions for the support team:
-
Why did my command lines using
cylc show ... | grep status
not show a danglingupd_convert_gfs
task? Those commands showed all tasks had been completed at that cycle point, but the danglingupd_convert_gfs
task which was holding things up wasn’t shown as waiting. Why? -
After running the suite for a while longer the workflow advanvced beyong the runahead of P1D, but
cylc dump
still showedoldest active cycle point=20231101T0000Z
. Why? Shouldn’t that cycle point be garbage collected at some point? Currently the workflow is already onnewest active cycle point=20231105T0000Z
but the oldest cycle point is still shown? How does that get garbage collected?
Thanks