Dependencies during boostrapping (R1 tasks)

fredw · November 2, 2023, 8:03pm

I am seeing odd behaviour in a suite I’m porting from cylc7 to cyl8 concerning the boostrapping section.
The logic I’m trying to achieve is explained in an earlier question [Controlling which R1 tasks run upon bootstrapping]
but here’s a short summary

    [[dependencies]]
        [[[R1]]]    # Run once at initial cycle
            graph = """
                make_grid => extract_grid_dimension => install_sources ==> build
            """ 
        [[[R1]]]    # Run once at initial cycle
            graph = """
                make_grid[^] => make_grib_forcing
                make_grid[^] => make_ocn_forcing
....
           """

The idea is that the first cycle runs all sorts of tasks that build the model config and compile the code.

The make_grid task build the model grid file, and all subsequent tasks that needs the grid file need to wait until this is complete.

That’s what the second R1 block does. The first time the forcing is built it has to wait for the make_grid task to finish (so we have the grid file on disk).

This used to work fine in cylc7, but with cylc8 I get failed tasks because the forcing tasks don’t wait for make_grid to finish.

Has this changed in cylc8?

I tried to google this, but search for “R1” is tricky, as the search term is too short.

Thanks
Fred

hilary.j.oliver · November 2, 2023, 11:31pm

Hi @fredw -

I’m not sure I understand the problem. I just ran your graph, exactly as you’ve written it, and it does exactly what you want - the forcing tasks wait for make_grid to finish.

BTW, it runs the same, but note that (a) you don’t need the explicit initial point dependence in make_grid[^] because that graph string only runs at the initial point anyway (R1 is short for R1/^); and (b) there’s no need for two separate R1 sections either. In Cylc 8 syntax, the equivalent (as a complete runnable example) is:

[scheduler]
   allow implicit tasks = True
[scheduling]
   [[graph]
      R1 = """
            make_grid => extract_grid_dimension => install_sources ==> build
            make_grid => make_grib_forcing & make_ocn_forcing
      """
[runtime]
   [[root]]
      script = sleep 10

Proof that it works:

$  cylc log fre | grep -E '=> (submitted|succeeded)' 
[1/make_grid preparing job:01 flows:1] => submitted
[1/make_grid running job:01 flows:1] => succeeded
[1/extract_grid_dimension preparing job:01 flows:1] => submitted
[1/make_grib_forcing preparing job:01 flows:1] => submitted
[1/make_ocn_forcing preparing job:01 flows:1] => submitted
[1/extract_grid_dimension running job:01 flows:1] => succeeded
[1/make_grib_forcing running job:01 flows:1] => succeeded
[1/make_ocn_forcing running job:01 flows:1] => succeeded
[1/install_sources preparing job:01 flows:1] => submitted
[1/install_sources running job:01 flows:1] => succeeded
[1/build preparing job:01 flows:1] => submitted
[1/build running job:01 flows:1] => succeeded

hilary.j.oliver · November 3, 2023, 1:44am

(Perhaps you mean there is a cycling graph after this, which you haven’t showed, that is not waiting for make_grid[^] to finish?)

fredw · November 3, 2023, 8:49pm

Thanks for looking into it.

The full graph is

    [[dependencies]]
        [[[R1]]]    # Run once at initial cycle
            graph = """
                make_grid => extract_grid_dimension => install_croco_sources => build_mpp_optimiz
                install_croco_sources => build_croco
                download_crocotools_datasets => make_grid
                (make_grid & build_mpp_optimiz) => mpi_noland_preprocessing => croco_run_set_directives
                mpi_noland_preprocessing => build_croco
                make_grid => make_mesh => create_regrid_meshes => create_interp_weights
                build_interp_tools => create_interp_weights
                build_uscc_engine
            """

        [[[R1]]]    # Run once at initial cycle
            graph = """
                build_uscc_engine[^] => upd_regrid_run_uscc

                make_grid[^] => make_GFS_from_grib
                make_grid[^] => make_OGCM_frcst_1_download
                install_ext_restarts => upd_poll_restarts

                (build_croco[^] & croco_run_set_directives[^]) => upd_croco_run
                download_crocotools_datasets[^] => make_tides
                create_interp_weights[^] => upd_croco_regrid

                {% if FORECAST_DAYS|int > 0 %}
                (build_croco[^] & croco_run_set_directives[^]) => fcst_croco_run
                create_interp_weights[^] => fcst_croco_regrid
                build_uscc_engine[^] => fcst_regrid_run_uscc
                {% endif %}  {# FORECAST_DAYS>0 #}
            """
        # Clock triggered start tasks
        [[[ T00 ]]]
            graph = "upd_suite_start_00 => poll_GFS_files_grib & poll_motu_mercator"
        [[[ T00 ]]]
            graph = """
                poll_GFS_files_grib => make_GFS_from_grib => make_tides
                poll_motu_mercator => make_OGCM_frcst_1_download => make_OGCM_frcst_2_convert => make_OGCM_frcst_3_interp
                make_tides => upd_croco_run
                make_OGCM_frcst_3_interp => upd_croco_run
                upd_croco_run[-{{CYCLING_INTVAL}}] => upd_poll_restarts => upd_croco_run
                upd_croco_run => upd_raw_archive => housekeep
                upd_croco_run => upd_croco_regrid => upd_regrid_run_uscc => upd_croco_archive => housekeep
            """
        # Forecast: additional model run to continue from end of update run into the future
        [[[ T00 ]]]
            graph = """
                upd_croco_run => fcst_poll_restarts
                fcst_poll_restarts => fcst_croco_run
                fcst_poll_restarts:succeeded => !fcst_expire_tasks
                fcst_poll_restarts:expired => fcst_expire_tasks
                make_tides => fcst_croco_run
                make_OGCM_frcst_3_interp => fcst_croco_run
                fcst_croco_run => fcst_raw_archive
                (fcst_raw_archive | fcst_raw_archive:expired) => housekeep
                fcst_croco_run => fcst_croco_regrid => fcst_regrid_run_uscc => fcst_croco_archive
                (fcst_croco_archive | fcst_croco_archive:expired) => housekeep
            """

I tried to show just a cut down version to illustrate the problem, but as you said - in the cut down verison there is no problem.

With the full graph, what I’m seing is that make_GFS_from_grib and make_OGCM_frcst_1_download run before make_grid is finished. Which leads to code running with incomplete or non-existent files and failing.

I think what’s happening here, is that on the new system I’m porting this to the R1 bootstrapping tasks run much faster (new hardware) and thus it runs tasks from the second cycle point before the prep tasks in R1 are finished. That never used to happen on the old (slower) system.

So maybe I need to sequence these tasks. Basically apply the [^] rule in R1 also to the second and third cycle point, so those tasks also wait for make_grid in R1

Is that possible?

fredw · November 3, 2023, 9:29pm

So I looked into the timings of when these ran after the suite started up for the first time.

so here’s make_grid:

CYLC_JOB_INIT_TIME=2023-11-02T20:03:01Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2023-11-02T20:03:06Z

and here are the 3 cycle points of make_OGCM_frcst_1_download that ran before the suite stalled due to errors:

::::::::::::::
20231020T0000Z/make_OGCM_frcst_1_download/01/job.status
::::::::::::::
CYLC_JOB_INIT_TIME=2023-11-02T20:03:10Z
CYLC_JOB_EXIT=TERM
CYLC_JOB_EXIT_TIME=2023-11-02T20:09:36Z

::::::::::::::
20231021T0000Z/make_OGCM_frcst_1_download/01/job.status
::::::::::::::
CYLC_JOB_INIT_TIME=2023-11-02T19:49:57Z
CYLC_JOB_EXIT=ERR
CYLC_JOB_EXIT_TIME=2023-11-02T19:50:01Z

::::::::::::::
20231022T0000Z/make_OGCM_frcst_1_download/01/job.status
::::::::::::::
CYLC_JOB_INIT_TIME=2023-11-02T19:49:58Z
CYLC_JOB_EXIT=ERR
CYLC_JOB_EXIT_TIME=2023-11-02T19:50:02Z

So the first run of 20231020T0000Z/make_OGCM_frcst_1_download did wait for make_grid to finish.

So that seems to confirm my hypothesis that the order was wrong in which these tasks were run

I then added those two tasks to the sequential keyword, and the suite is running fine so far:

    [[special tasks]]
        sequential =  housekeep, make_OGCM_frcst_1_download, make_GFS_from_grib

Now my question is - how can I avoid running those tasks in sequence, but each future instance of make_OGCM_frcst_1_download and make_GFS_from_grib waits until the R1 run of make_grid is finished.

In its original version the task 20231020T0000Z/make_OGCM_frcst_1_download waited for the R1 run of make_gid (NOTE that this is the initial cycle point), but the subsequent task 20231021T0000Z/make_OGCM_frcst_1_download didn’t wait.

How can I express that in the graph?

Thank you very much
Fred

hilary.j.oliver · November 3, 2023, 11:23pm

Yep, your diagnosis is correct. Although you were probably just lucky that it worked before on slower hardware.

Here’s what cylc graph -c shows for the first three cycle points (note I guessed your two Jinja2 variables as CYCLING_INTVAL = "P1D" and FORECAST_DAYS = 0):

I have added red dots to parentless tasks - remember Cylc has “no barrier between cycles” so all of those can start running at once, at start-up.

As you can see (in yellow) - the tasks you want to wait on make_grid are only forced to do so by dependencies in the first cycle point.

To fix it, all you need to do is add the same initial point dependencies to those tasks in subsequent cycle points (i.e., in your repeat daily at T00 sections):

[[[T00]]]
    graph = """
        make_grid[^] => make_GFS_from_grib
        make_grid[^] => make_OGCM_frcst_1_download
    """

[As an aside, while this kind of graph structure works perfectly well, it is arguably a bit inelegant in that maintaining initial point dependence forever seems like overkill. To make this sort of thing simpler a future release will allow distinct start-up and shutdown graphs that are separated from the main graph by a hard barrier - largely implemented already but trumped by more urgent development for the moment].

fredw · November 6, 2023, 8:20pm

That makes so much sense. Thank you for points out this simple change!

Looking at your graph it looks like you’re using the old interface to plot that. That’s what the graph should look like.
But In the new web-based cylc8 GUI I can’t see this at all. The graph is hacked up into small bits all over the place (see below) and nothing is connected up. I cannot group the items into boxes per cyclepoint either. And the suite doesn’t really run properly - it constantly stalls. I have to trigger everything manually, so this suite is completely broken in cylc8. Do you have any ideas why this might be?

Screenshot of the graph view in the new cylc8 GUI:

zoomed in a bit:

hilary.j.oliver · November 6, 2023, 10:14pm

No, the cylc graph command, for static graph visualization, is part of Cylc 8 too.

The Cylc 8 web UI is designed for efficient monitoring and control of running workflows, which has quite different requirements compared to graph visualization.

graph layout algorithms are computationally expensive, so large workflows could overwhelm your browser if the full graph is displayed by default
the Cylc 8 scheduler only needs to keep track of “active tasks” as they move along the (potentially infinite) graph; how much you see around those active tasks is now just a visualization choice

By default (for efficiency and de-cluttering reasons) you see the active tasks plus n=1 graph edges (arrows) out from them. That can result in distinct chunks because task activity does not have to be contiguous in the graph.

To see a more joined-up graph (with more waiting and completed tasks in between the active ones) use “Set Graph Window Extent” in the workflow drop-down menu in the web UI.

And the suite doesn’t really run properly - it constantly stalls.

That has nothing to do with graph visualization. A “stall” means the scheduler can’t run anything because something (normally, one of your tasks, not Cylc) has gone wrong. The scheduler log will report the presence of incomplete tasks (e.g. failed tasks that were required to succeed) and/or tasks that can’t run because they have partially (not fully) satisfied prerequisites.

What does your scheduler log say?

fredw · November 13, 2023, 8:45pm

Hello again,

I’ve investigated the stalling suite again. I found this in the workflow log:

2023-11-13T18:22:51Z ERROR - Incomplete tasks:
      * 20231024T0400Z/crop_gpm_precip did not complete required outputs: ['succeeded']
2023-11-13T18:22:51Z WARNING - Partially satisfied prerequisites:
      * 20231012T0000Z/housekeep is waiting on ['20231012T0000Z/get_modis_chla:expired', '20231012T0000Z/crop_modis_chla:succeeded', '20231012T0000Z/upd_convert_gfs:succeeded', '20231012T0000Z/upd_get_ncep_gfs:expired', '20231012T0000Z/get_modis_kd490:expired', '20231012T0000Z/crop_modis_kd490:succeeded', '20231012T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231012T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231012T0000Z/get_seviri:expired', '20231012T0000Z/get_seviri:succeeded']
      * 20231013T0000Z/housekeep is waiting on ['20231013T0000Z/get_modis_chla:expired', '20231013T0000Z/crop_modis_chla:succeeded', '20231013T0000Z/upd_convert_gfs:succeeded', '20231013T0000Z/upd_get_ncep_gfs:expired', '20231013T0000Z/get_modis_kd490:expired', '20231013T0000Z/crop_modis_kd490:succeeded', '20231013T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231013T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231013T0000Z/get_seviri:expired', '20231013T0000Z/get_seviri:succeeded', '20231012T2300Z/housekeep:succeeded']
      * 20231014T0000Z/housekeep is waiting on ['20231014T0000Z/get_modis_chla:expired', '20231014T0000Z/crop_modis_chla:succeeded', '20231014T0000Z/upd_convert_gfs:succeeded', '20231014T0000Z/upd_get_ncep_gfs:expired', '20231014T0000Z/get_modis_kd490:expired', '20231014T0000Z/crop_modis_kd490:succeeded', '20231014T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231014T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231014T0000Z/get_seviri:expired', '20231014T0000Z/get_seviri:succeeded', '20231013T2300Z/housekeep:succeeded']
      * 20231012T1200Z/housekeep is waiting on ['20231012T1200Z/fcst07_get_ncep_gfs:expired', '20231012T1200Z/fcst07_convert_gfs:succeeded', '20231012T1200Z/fcst02_get_ncep_gfs:expired', '20231012T1200Z/fcst02_convert_gfs:succeeded', '20231012T1200Z/fcst04_get_ncep_gfs:expired', '20231012T1200Z/fcst04_convert_gfs:succeeded', '20231012T1200Z/fcst10_convert_gfs:succeeded', '20231012T1200Z/fcst10_get_ncep_gfs:expired', '20231012T1200Z/fcst01_convert_gfs:succeeded', '20231012T1200Z/fcst01_get_ncep_gfs:expired', '20231012T1200Z/fcst11_convert_gfs:succeeded', '20231012T1200Z/fcst11_get_ncep_gfs:expired', '20231012T1200Z/fcst15_convert_gfs:succeeded', '20231012T1200Z/fcst15_get_ncep_gfs:expired', '20231012T1200Z/fcst14_get_ncep_gfs:expired', '20231012T1200Z/fcst14_convert_gfs:succeeded', '20231012T1200Z/fcst06_convert_gfs:succeeded', '20231012T1200Z/fcst06_get_ncep_gfs:expired', '20231012T1200Z/fcst09_get_ncep_gfs:expired', '20231012T1200Z/fcst09_convert_gfs:succeeded',
    '20231012T1200Z/fcst13_convert_gfs:succeeded', '20231012T1200Z/fcst13_get_ncep_gfs:expired', '20231012T1200Z/upd_convert_gfs:succeeded', '20231012T1200Z/upd_get_ncep_gfs:expired', '20231012T1200Z/fcst00_convert_gfs:succeeded', '20231012T1200Z/fcst00_get_ncep_gfs:expired', '20231012T1200Z/fcst05_convert_gfs:succeeded', '20231012T1200Z/fcst05_get_ncep_gfs:expired', '20231012T1200Z/fcst03_convert_gfs:succeeded', '20231012T1200Z/fcst03_get_ncep_gfs:expired', '20231012T1200Z/fcst08_get_ncep_gfs:expired', '20231012T1200Z/fcst08_convert_gfs:succeeded', '20231012T1200Z/fcst12_get_ncep_gfs:expired', '20231012T1200Z/fcst12_convert_gfs:succeeded', '20231012T1200Z/wrfda_get_ncep_gdas_obs:expired', '20231012T1200Z/wrfda_get_ncep_gdas_obs:succeeded', '20231012T1200Z/get_seviri:expired', '20231012T1200Z/get_seviri:succeeded', '20231012T1100Z/housekeep:succeeded']
      * 20231015T0000Z/housekeep is waiting on ['20231015T0000Z/get_modis_chla:expired', '20231015T0000Z/crop_modis_chla:succeeded', '20231015T0000Z/upd_convert_gfs:succeeded', '20231015T0000Z/upd_get_ncep_gfs:expired', '20231015T0000Z/get_modis_kd490:expired', '20231015T0000Z/crop_modis_kd490:succeeded', '20231015T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231015T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231015T0000Z/get_seviri:expired', '20231015T0000Z/get_seviri:succeeded', '20231014T2300Z/housekeep:succeeded']
      * 20231013T1200Z/housekeep is waiting on ['20231013T1200Z/fcst07_get_ncep_gfs:expired', '20231013T1200Z/fcst07_convert_gfs:succeeded', '20231013T1200Z/fcst02_get_ncep_gfs:expired', '20231013T1200Z/fcst02_convert_gfs:succeeded', '20231013T1200Z/fcst04_get_ncep_gfs:expired', '20231013T1200Z/fcst04_convert_gfs:succeeded', '20231013T1200Z/fcst10_convert_gfs:succeeded', '20231013T1200Z/fcst10_get_ncep_gfs:expired', '20231013T1200Z/fcst01_convert_gfs:succeeded', '20231013T1200Z/fcst01_get_ncep_gfs:expired', '20231013T1200Z/fcst11_convert_gfs:succeeded', '20231013T1200Z/fcst11_get_ncep_gfs:expired', '20231013T1200Z/fcst15_convert_gfs:succeeded', '20231013T1200Z/fcst15_get_ncep_gfs:expired', '20231013T1200Z/fcst14_get_ncep_gfs:expired', '20231013T1200Z/fcst14_convert_gfs:succeeded', '20231013T1200Z/fcst06_convert_gfs:succeeded', '20231013T1200Z/fcst06_get_ncep_gfs:expired', '20231013T1200Z/fcst09_get_ncep_gfs:expired', '20231013T1200Z/fcst09_convert_gfs:succeeded',
    '20231013T1200Z/fcst13_convert_gfs:succeeded', '20231013T1200Z/fcst13_get_ncep_gfs:expired', '20231013T1200Z/upd_convert_gfs:succeeded', '20231013T1200Z/upd_get_ncep_gfs:expired', '20231013T1200Z/fcst00_convert_gfs:succeeded', '20231013T1200Z/fcst00_get_ncep_gfs:expired', '20231013T1200Z/fcst05_convert_gfs:succeeded', '20231013T1200Z/fcst05_get_ncep_gfs:expired', '20231013T1200Z/fcst03_convert_gfs:succeeded', '20231013T1200Z/fcst03_get_ncep_gfs:expired', '20231013T1200Z/fcst08_get_ncep_gfs:expired', '20231013T1200Z/fcst08_convert_gfs:succeeded', '20231013T1200Z/fcst12_get_ncep_gfs:expired', '20231013T1200Z/fcst12_convert_gfs:succeeded', '20231013T1200Z/wrfda_get_ncep_gdas_obs:expired', '20231013T1200Z/wrfda_get_ncep_gdas_obs:succeeded', '20231013T1200Z/get_seviri:expired', '20231013T1200Z/get_seviri:succeeded', '20231013T1100Z/housekeep:succeeded']
      * 20231016T0000Z/housekeep is waiting on ['20231016T0000Z/get_modis_chla:expired', '20231016T0000Z/crop_modis_chla:succeeded', '20231016T0000Z/upd_convert_gfs:succeeded', '20231016T0000Z/upd_get_ncep_gfs:expired', '20231016T0000Z/get_modis_kd490:expired', '20231016T0000Z/crop_modis_kd490:succeeded', '20231016T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231016T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231016T0000Z/get_seviri:expired', '20231016T0000Z/get_seviri:succeeded', '20231015T2300Z/housekeep:succeeded']
      * 20231017T0000Z/housekeep is waiting on ['20231017T0000Z/get_modis_chla:expired', '20231017T0000Z/crop_modis_chla:succeeded', '20231017T0000Z/upd_convert_gfs:succeeded', '20231017T0000Z/upd_get_ncep_gfs:expired', '20231017T0000Z/get_modis_kd490:expired', '20231017T0000Z/crop_modis_kd490:succeeded', '20231017T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231017T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231017T0000Z/get_seviri:expired', '20231017T0000Z/get_seviri:succeeded', '20231016T2300Z/housekeep:succeeded']
      * 20231018T0000Z/housekeep is waiting on ['20231018T0000Z/get_modis_chla:expired', '20231018T0000Z/crop_modis_chla:succeeded', '20231018T0000Z/upd_convert_gfs:succeeded', '20231018T0000Z/upd_get_ncep_gfs:expired', '20231018T0000Z/get_modis_kd490:expired', '20231018T0000Z/crop_modis_kd490:succeeded', '20231018T0000Z/wrfda_get_ncep_gdas_obs:expired', '20231018T0000Z/wrfda_get_ncep_gdas_obs:succeeded', '20231018T0000Z/get_seviri:expired', '20231018T0000Z/get_seviri:succeeded', '20231017T2300Z/housekeep:succeeded']
      * 20231014T1200Z/housekeep is waiting on ['20231014T1200Z/fcst07_get_ncep_gfs:expired', '20231014T1200Z/fcst07_convert_gfs:succeeded', '20231014T1200Z/fcst02_get_ncep_gfs:expired', '20231014T1200Z/fcst02_convert_gfs:succeeded', '20231014T1200Z/fcst04_get_ncep_gfs:expired', '20231014T1200Z/fcst04_convert_gfs:succeeded', '20231014T1200Z/fcst10_convert_gfs:succeeded', '20231014T1200Z/fcst10_get_ncep_gfs:expired', '20231014T1200Z/fcst01_convert_gfs:succeeded', '20231014T1200Z/fcst01_get_ncep_gfs:expired', '20231014T1200Z/fcst11_convert_gfs:succeeded', '20231014T1200Z/fcst11_get_ncep_gfs:expired', '20231014T1200Z/fcst15_convert_gfs:succeeded', '20231014T1200Z/fcst15_get_ncep_gfs:expired', '20231014T1200Z/fcst14_get_ncep_gfs:expired', '20231014T1200Z/fcst14_convert_gfs:succeeded', '20231014T1200Z/fcst06_convert_gfs:succeeded', '20231014T1200Z/fcst06_get_ncep_gfs:expired', '20231014T1200Z/fcst09_get_ncep_gfs:expired', '20231014T1200Z/fcst09_convert_gfs:succeeded',

and further down:

      * 20231022T1800Z/housekeep is waiting on ['20231022T1800Z/upd_convert_gfs:succeeded', '20231022T1800Z/upd_get_ncep_gfs:expired', '20231022T1800Z/wrfda_get_ncep_gdas_obs:expired', '20231022T1800Z/wrfda_get_ncep_gdas_obs:succeeded', '20231022T1800Z/get_seviri:expired', '20231022T1800Z/get_seviri:succeeded', '20231022T1700Z/housekeep:succeeded']
      * 20231022T1900Z/housekeep is waiting on ['20231022T1900Z/get_seviri:expired', '20231022T1900Z/get_seviri:succeeded', '20231022T1800Z/housekeep:succeeded']
      * 20231022T2000Z/housekeep is waiting on ['20231022T2000Z/get_seviri:expired', '20231022T2000Z/get_seviri:succeeded', '20231022T1900Z/housekeep:succeeded']
      * 20231022T2100Z/housekeep is waiting on ['20231022T2100Z/get_seviri:expired', '20231022T2100Z/get_seviri:succeeded', '20231022T2000Z/housekeep:succeeded']

and at the end

2023-11-13T18:22:51Z CRITICAL - Workflow stalled
2023-11-13T18:22:51Z WARNING - PT1H stall timer starts NOW
2023-11-13T19:22:51Z WARNING - stall timer timed out after PT1H
2023-11-13T19:22:51Z ERROR - Workflow shutting down - "abort on stall timeout" is set
2023-11-13T19:22:51Z INFO - platform: cluster-slurm - remote tidy (on metoc-cl4)
2023-11-13T19:22:52Z INFO - DONE

I find this strange because the graph explicitly says that :expired should also lead to housekeep.

Here’s the section from the graph in question:

        {% if DOWNLOAD_SEVIRI is sameas true %}
        [[[T-00]]] # every hour at zero minutes past (every hour on the hour). 
            # Task is clock triggered at <CYCLE> + 3:25 hours
            graph = "(get_seviri | get_seviri:expired) => housekeep"
        {% endif %}  # DOWNLOAD_SEVIRI

There are more sections like this in the graph, but they all follow a similar pattern.

This syntax used to work in cylc7, am I getting something wrong in cylc8?

A slight variation of the pattern using the expired status is this:

        [[[T18]]]
            # The upd_get_ncep_gfs task is clock triggered at <CYCLE> + 3 hours, 30 minutes
            # The sst_get_ncep_gfs task is clock triggered at 00Z + 22 hours, 35 minutes (which is <CYCLE> + 4hours, 35 minutes)
            graph = """
                upd_get_ncep_gfs => upd_convert_gfs
                (sst_get_ncep_gfs? | sst_get_ncep_gfs:expired?) => upd_convert_gfs
                (upd_get_ncep_gfs:expired? | upd_convert_gfs) => housekeep
            """

and this part of the graph

        {% if DOWNLOAD_NASA_GPM is sameas true %}
        [[[T-00]]] # every hour at zero minutes past (every hour on the hour). Note that the - character takes the place of the hour digits as we may n$
            # Task is clock triggered at <CYCLE> + 15 hours 50 minutes (e.g. the T00 is triggered at 19:50 GST)
            graph = """
                get_gpm_precip? => crop_gpm_precip
                (get_gpm_precip:expired | crop_gpm_precip) => housekeep
            """
        {% endif %}  # DOWNLOAD_NASA_GPM

and I seem to remember you saying that the ?-notation won’t work properly until v8.3. So what does this do in v8.2? I can see the get_gpm_precip and crop_gpm_precip tasks are also holding up the suite and causing it to stall. I’m starting to think this might be the wrong syntax for what I’m trying to (in cylc7 I did this with suicide triggers).

This suite which schedules the download of forcing data for various models makes heavy use of the :expired mechanism, because lots of download sites do not store data indefinitely, but give you a 3-day or 7-day window within which the download is live and afterwards it’s not available anymore, hence I expire those tasks which have no chance of ever getting the data once it’s too late.

Thank you for all of your replies so far, and I’m grateful for any help!

Thank you, Fred

hilary.j.oliver · November 13, 2023, 9:47pm

Hi @fredw

OK, the expire trigger is currently broken, sorry. It has already been fixed as part of the manual intervention enhancements to be released soon in 8.3.0.

Expire triggers are pretty niche at my site, but your case for extensive use makes good sense. If you have a desperate need for it, before 8.3.0 comes out, I could tell you how to patch your current version.

Here’s a standalone test workflow to show the problem:

 $ cat flow.cylc
[scheduling]
    initial cycle point = previous(T00)
    [[special tasks]]
        clock-expire = foo(PT0S)
    [[graph]]
        R1 = """
           foo => bar
           foo:expired? => baz
        """
[runtime]
    [[foo, bar, baz]]

The task foo expires immediately, which should trigger baz, but at the moment it does not:

[I] (venv) ~/c/exp $ cylc vip --no-detach --no-timestamp
$ cylc validate /home/oliverh/cylc-src/exp
Valid for cylc-8.2.3
$ cylc install /home/oliverh/cylc-src/exp
INSTALLED exp/run7 from /home/oliverh/cylc-src/exp
$ cylc play --no-detach --no-timestamp exp/run7

 ▪ ■  Cylc Workflow Engine 8.2.3
 ██   Copyright (C) 2008-2023 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to /home/oliverh/cylc-run/exp/run7/.service/etc/job.sh
INFO - Workflow: exp/run7
INFO - Scheduler: url=tcp://NIWA-1022450.niwa.local:43090 pid=25973
INFO - Workflow publisher: url=tcp://NIWA-1022450.niwa.local:43023
INFO - Run: (re)start number=1, log rollover=1
INFO - Cylc version: 8.2.3
INFO - Run mode: live
INFO - Initial point: 20231113T0000Z
INFO - Final point: None
INFO - Cold start from 20231113T0000Z
INFO - New flow: 1 (original flow from 20231113T0000Z) 2023-11-14 10:46:07
INFO - [20231113T0000Z/foo waiting(runahead) job:00 flows:1] => waiting
INFO - [20231113T0000Z/foo waiting job:00 flows:1] => waiting(queued)
WARNING - [20231113T0000Z/foo waiting(queued) job:00 flows:1] Task expired (skipping job).
INFO - [20231113T0000Z/foo waiting(queued) job:00 flows:1] => expired(queued)
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

fredw · November 14, 2023, 11:06am

Hi @hilary.j.oliver
Thanks for confirming this issue with v8.2. What’s your suggestions to get round this? Is there a 8.3-prerelease branch I can try out?

My project to port the suites to cylc8 is now 3 months late due to various delays, and I need to make a decision whether it’s even feasible to deliver now. A data downloader workflow without expiry triggers is just not possible (it would regularly run into failure statues after the retries have run out).

Thanks, Fred

oliver.sanders · November 14, 2023, 4:44pm

Apologies for the inconvenience, expire triggers have proven a little tricky to re-implement in the new scheduling algorithm. The 8.3.0 branch is taking shape so shouldn’t be too far away. The work you’re waiting on is under development in this pull request: Implement "cylc set" command by hjoliver · Pull Request #5658 · cylc/cylc-flow · GitHub, if you’re feeling brave, you could install from this branch for development.

Otherwise, here’s a workaround, it uses a clock-trigger (which you would provide with any expiry offset you need) top trigger a task which checks whether the data_downloader task has submitted, and if it hasn’t, it triggers the tasks which would have been run by the expiry trigger:

[scheduling]
    initial cycle point = 2020
    [[xtriggers]]
        wall_clock_expire=wall_clock()
    [[graph]]
        R1 = """
            # simulate a workflow getting behind by jamming in a delay
            delay => data_downloader
        """
        P1Y = """
            some_task[-P1Y] => data_downloader

            # data_downloader:expire => recovery_task
            @wall_clock_expire => manual_expire  # temporary workaround

            data_downloader | recovery_task => some_task
        """

[runtime]
    [[manual_expire]]
        script = """
            # check if data_downloader has submitted
            if ! cylc workflow-state \
                "$CYLC_WORKFLOW_ID"
                --point "$CYLC_TASK_CYCLE_POINT" \
                --task "data_downloader" \
                -S submitted \
                --max-polls=5
            then
                # if it hasn't, then remove it and trigger the recovery_task
                cylc remove "$CYLC_WORKFLOW_ID//$CYLC_TASK_CYCLE_POINT/data_downloader"
                cylc trigger "$CYLC_WORKFLOW_ID//$CYLC_TASK_CYCLE_POINT/recovery_task"
            fi
        """
    [[delay]]
        script = sleep 60
    [[some_task, data_downloader, recovery_task]]

It’s a little ugly but it should do the job.

Note, the max-polls option configures retries so this approach is pretty robust.

fredw · November 14, 2023, 7:42pm

Thanks for the workaround! I just feel I have so many placed where I use the :expire trigger that I would explode the complexity of my suites and never really regain any clarity or readability after that.

I checked out the branch mentioned in the PR, and I have installed it into a conda env, and I’m currently testing whether I can get those suites (where I use the :expire trigger) to work

hilary.j.oliver · November 14, 2023, 10:49pm

@fredw -

It’s great that you’ve evidently got that feature branch installed. However, note that it will have other consequences that have not been documented yet (specifically, if you need to use cylc set-outputs at all, that gets replaced by cylc set for manually completing prerequisites and outputs, and more).

But if you want a simpler solution, fixing expire triggers is just a tiny part of that branch. Relative to the 8.2.2 or 8.2.3 releases you only need to add a couple of lines of code to a couple of files:

(Note you will need to do this to task_events_mgr.py and task_pool.py inside your conda or pip virtual environment).

$ git diff
diff --git a/cylc/flow/task_events_mgr.py b/cylc/flow/task_events_mgr.py
index 9a8a19ba1..de785b45e 100644
--- a/cylc/flow/task_events_mgr.py
+++ b/cylc/flow/task_events_mgr.py
@@ -684,6 +684,9 @@ class TaskEventsManager():
                 self.data_store_mgr.delta_job_attr(
                     job_tokens, 'job_id', itask.summary['submit_method_id'])

+        elif message == "expired":
+            self.spawn_func(itask, message)
+
         elif message.startswith(FAIL_MESSAGE_PREFIX):
             # Task received signal.
             if (
diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index 6ddc62d39..8f27d7d9d 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -1829,6 +1829,8 @@ class TaskPool:
             if itask.state_reset(TASK_STATUS_EXPIRED, is_held=False):
                 self.data_store_mgr.delta_task_state(itask)
                 self.data_store_mgr.delta_task_held(itask)
+            self.task_events_mgr.process_message(
+                itask, logging.WARNING, "expired")
             self.remove(itask, 'expired')
             return True
         return False

I’ve tested this with the small example above, and the expire triggers do work.

hilary.j.oliver · November 14, 2023, 11:02pm

For the record, I think what Oliver is talking about there is not so much the (re)implementation - which as you can see just above is trivial - but discussions about when we can detect that a task has expired.

In Cylc 8, the more efficient event-driven scheduling algorithm does not become aware of tasks until they are “on demand”, as it were, which means when the first prerequisite gets satisfied. And that is when it will begin to check for task expiration.

Cylc 7 could detect expiration earlier, because tasks got pre-spawned into the scheduler’s awareness before they were needed (about one cycle point ahead, roughly speaking).

However, the reasons for time-based task expiration, in order of importance, are:

a task should not run if it expired
we might want to trigger other tasks, and event handlers, off of task expiration

In Cylc 8 both of these (will) still work exactly as advertised, but 2. won’t happen as early as it might have in Cylc 7, unless you add a dummy dependency to spawn the expiring task earlier.

(And it’s worth noting that the timing of expiry detection in Cylc 7, while typically earlier than Cylc 8, was also moderated by the scheduling algorithm, not just the clock time).

fredw · November 15, 2023, 8:10pm

Thanks for the patch, I have applied it to 8.2.3 and installed it.

fredw · November 15, 2023, 9:46pm

Hello Hilary and Oliver,

Once I applied the patch, can I just confirm that this should now be working?

        [[[T00]]]
            graph = """
                get_modis_chla? => fix_modis_nctime_chla => crop_modis_chla
                (get_modis_chla:expired? | crop_modis_chla) => housekeep
            """

I have a feeling it doesn’t, as the suite gets stuck all the time. To check why I tend to just look at the housekeep task in a given cycle point, it’ll tell which tasks are holding it up.

Here’s an example output of: cylc show downloader_suite //.../housekeep

  - (1 | 0)
  -     0 = 20231025T0000Z/crop_modis_chla succeeded
  -     1 = 20231025T0000Z/get_modis_chla expired

I then inspected the 20231025T0000Z/crop_modis_chla task which hasn’t succeeded and I was told the task doesn’t exist. I can’t work out why it wasn’t instantiated, neither was fix_modis_nctime_chla at that cycle point, nor was get_modis_chla an active task at that cycle point.

$ cylc show downloader_suite_cl4 //20231025T0000Z/crop_modis_chla
No matching active tasks found: 20231025T0000Z/crop_modis_chla
$ cylc show downloader_suite_cl4 //20231025T0000Z/fix_modis_nctime_chla
No matching active tasks found: 20231025T0000Z/fix_modis_nctime_chla
$ cylc show downloader_suite_cl4 //20231025T0000Z/get_modis_chla
No matching active tasks found: 20231025T0000Z/get_modis_chla
$ cylc show downloader_suite_cl4 //20231025T0000Z/housekeep
...
  - (1 | 0)
  -     0 = 20231025T0000Z/crop_modis_chla succeeded
  -     1 = 20231025T0000Z/get_modis_chla expired
...

So I’m cluelless how the suite got into that state. By looking at the graph, surely one of the conditions that lead to housekeep must have been satisfied at some point.

Thanks for any help!

hilary.j.oliver · November 15, 2023, 9:51pm

Yes it should be working. I presume you applied both code blocks, not just the one you quoted in your reponse. To confirm, can you try running my small example above Dependencies during boostrapping (R1 tasks) - #10 by hilary.j.oliver - to avoid the extra complications of your real workflow?

hilary.j.oliver · November 15, 2023, 9:54pm

Note the error message says “no active task found”. The cylc show command queries tasks in the scheduler’s “active window”. It can’t see future tasks that have not entered the window yet, or past tasks that successfully completed and so were able leave the window.

To see if a task ran in the past, check the scheduler log (on disk, or use cylc log <workflow-ID>.

The scheduler log records events as the workflow runs, in order, especially task state changes that occur as outputs are generated and downstream tasks trigger. You should be able to see what happened throughout the run (if it wasn’t obvious during real time monitoring).

fredw · November 16, 2023, 7:57pm

I ran the example you posted, which should trigger the task baz

[scheduling]
    initial cycle point = previous(T00)
    [[special tasks]]
        clock-expire = foo(PT0S)
    [[graph]]
        R1 = """
           foo => bar
           foo:expired? => baz
        """
[runtime]
    [[foo, bar, baz]]

results in this scheduler log:

2023-11-16T23:52:58+04:00 INFO - Workflow: test_expire/run1
2023-11-16T23:52:58+04:00 INFO - Scheduler: url=tcp://cylcsrvr2:43015 pid=781858
2023-11-16T23:52:58+04:00 INFO - Workflow publisher: url=tcp://cylcsrvr2:43089
2023-11-16T23:52:58+04:00 INFO - Run: (re)start number=1, log rollover=1
2023-11-16T23:52:58+04:00 INFO - Cylc version: 8.2.3
2023-11-16T23:52:58+04:00 INFO - Run mode: live
2023-11-16T23:52:58+04:00 INFO - Initial point: 20231116T0000Z
2023-11-16T23:52:58+04:00 INFO - Final point: None
2023-11-16T23:52:58+04:00 INFO - Cold start from 20231116T0000Z
2023-11-16T23:52:58+04:00 INFO - New flow: 1 (original flow from 20231116T0000Z) 2023-11-16 23:52:58
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/foo waiting(runahead) job:00 flows:1] => waiting
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/foo waiting job:00 flows:1] => waiting(queued)
2023-11-16T23:52:58+04:00 WARNING - [20231116T0000Z/foo waiting(queued) job:00 flows:1] Task expired (skipping job).
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/foo waiting(queued) job:00 flows:1] => expired(queued)
2023-11-16T23:52:58+04:00 WARNING - [20231116T0000Z/foo expired(queued) job:00 flows:1] (internal)expired at 2023-11-16T23:52:58+04:00
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/baz waiting(runahead) job:00 flows:1] => waiting
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/baz waiting job:00 flows:1] => waiting(queued)
2023-11-16T23:52:58+04:00 WARNING - [20231116T0000Z/foo expired(queued) job:00 flows:1] did not complete required outputs: ['submitted', 'succeeded']
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/baz waiting(queued) job:00 flows:1] => waiting
2023-11-16T23:52:58+04:00 INFO - [20231116T0000Z/baz waiting job:01 flows:1] => preparing
2023-11-16T23:53:00+04:00 INFO - [20231116T0000Z/baz preparing job:01 flows:1] submitted to localhost:background[781866]
2023-11-16T23:53:00+04:00 INFO - [20231116T0000Z/baz preparing job:01 flows:1] => submitted
2023-11-16T23:53:00+04:00 INFO - [20231116T0000Z/baz submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-11-16T23:53:03+04:00 INFO - [20231116T0000Z/baz submitted job:01 flows:1] => running
2023-11-16T23:53:03+04:00 INFO - [20231116T0000Z/baz running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-11-16T23:53:03+04:00 INFO - [20231116T0000Z/baz running job:01 flows:1] => succeeded
2023-11-16T23:53:03+04:00 INFO - Workflow shutting down - AUTOMATIC
e[0m2023-11-16T23:53:03+04:00 INFO - DONE

So I can see that baz was triggered by the expiry, which give me hope that I’ll be able to get my bigger suites to work as well.

Thanks for those patches and examples!

Topic		Replies	Views
Debug a stalling suite that uses expiring tasks Cylc 8 Migration	6	284	November 23, 2023
Best practice for on demand systems? Cylc Support	14	1200	August 20, 2020
Task:succeeded can't be both required and optional Cylc 8 Migration	19	505	August 5, 2022
Using cylc submit with a non-scheduled task Cylc Support	13	481	May 16, 2021
Wall clock synchronization (real time scheduling) Cylc Support	16	1387	July 16, 2019

Dependencies during boostrapping (R1 tasks)

Related topics