Ensembles - moving on with failed runs

I am running an ensemble and using the retry mechanism to make adjustments and restart ensemble members that fail prior to completion, however these adjustments don’t always work and sometimes members still fail after retries are exhausted, currently when this happens the cycl suite stops. I want to just send an email letting the user know that the ensemble member has failed and have the suite continue to the next step - how do I do that?

sometimes members still fail after retries are exhausted, currently when this happens the cycl suite stops.

By that, I presume you mean the workflow stalls because downstream tasks depend on all ensemble members succeeding:

ensemble:succeed-all => postprocess

If you don’t care how many members succeed or fail, you can just do this:

ensemble:finish-all => postprocess

then postprocess will trigger once all ensemble members have finished (either succeeded or failed).

If you do care how many members fail, we can’t yet encode in the graph that (say) downstream tasks should trigger only if 60% or more succeed, but you can do that easily with an intermediate task:

ensemble:finish-all => check-ensemble => postprocess

(here the check-ensemble task could query the suite server program to compute what fraction of ensemble members succeeded or failed, and either issue an alert and fail (to halt the workflow) or succeed (to let the workflow carry on).

To send emails on task failure just attach mail handlers to task fail events. This works fine with retries, because a task that fails does not actually enter the failed state (and hence trigger the failed event) until it has run out of retries.

Interesting, your response makes me think that there is another way to write my suite. Currently I have:

    member = 0..10

  initial cycle point = 19990802T0000Z
  final cycle point = 19990816T0000Z

      graph = "get_data => run<member>:finish => st_archive<member> "
    [[[R/P1W]]] # Weekly Cycling
      graph = """
	     st_archive<member>[-P1W] => get_data => run<member>
             run<member>:finish => st_archive<member>

I tried to use the finish-all as you suggested, but got an error:
ERROR, family trigger on non-family namespace run_member04:finish-all

How would I recast my suite.rc in terms of a family?

Hi Jim,

By the way, if you enclose your code snippets in triple backquotes - ``` - the formatting will be retained (I edited your post to insert the quotes)

Also, you don’t need the R1 graph section above, as the dependencies in it are the same as in the general R/P1W section (which starts at the initial cycle point anyway).

Family triggers like succeed-all can only be used on family names, not on individual tasks or family members (and run<m> is an individual task, for any given value of m).

You can easily put all of your ensemble members in a family:

         # ...
         inherit = RUN

If your ensemble is generated by parameter expansion you don’t really gain much by using family triggers in the graph though. These two lines are equivalent, and pretty much equally concise:

RUN:finish-all => postproc
run<member>:finish => postproc 

There’s also a complication in your workflow: run<member> => archive<member> is one-to-one dependence (run_1 => archive_1, run_2 => archive_2 etc.) so you can’t make an ARCHIVE family and do RUN:finish-all => ARCHIVE (say) because that means every member of ARCHIVE depends on every member of RUN finishing, which is all-to-all dependence. Families might still be useful for inheritance of runtime settings and visualization (collapse/expand in the GUI and cylc graph) but you might want to stick to individual member triggers in the graph, or perhaps some family triggers and some member triggers (for the one-to-one case as above).