Variable length task parameterisation

I have used cylc for a couple of years now to perform variations of workflows in ground motion simulations, where we are interested in simulating a number of events, with each event having a number of realisations. Realisations of the same event are generally independent but share in particular a common velocity model. The workflow graph would ideally something like:

generate_velocity_model<event> => stochastic_part<event, realisation> & simulate<event, realisation> => combine<event, realisation> => upload<event, realisation> => cleanup<event>

If for example, we have 3 events each having 10 realisations then this workflow works beautifully using task parameters. The trouble is that we often have cases where we have one event with 10 realisations, and one event with just one. There the naive task parameter approach falls over. I have attempted to work around this by using Jinja templates to generate the entire workflow graph in one go, which works but destroys readability and is inelegant. Is there an elegant way to implement this sort of workflow in cylc? I imagine this could be achieved without Jinja by breaking up the workflow into several other workflows but from an operational standpoint this is a little annoying (e.g. being able to cylc stop one workflow to stop all events related to a particular scientific outcome)

If I’ve understood you correctly, I think you could use Jinja2 to create a new parameterized graph for each set of events with the same number of realizations (i.e. use Jinja2 to parameterize the native task name parameters). That’s probably more elegant than doing it all with Jinja2.

Example for 2 events (a, b) with 3 realisations, and 1 event (c) with 1 realization:

#!Jinja2

{# dictionary {n_realisations: [list of event names] } #}
{% set cfg = {
        3: ["a", "b"],
        1: ["c"],
    }
%}

[scheduler]
    allow implicit tasks = True
[task parameters]
{% for k, v in cfg.items() %}
    event_{{k}} = {{', '.join(v)}}
    realisation_{{k}} = 1..{{k}}
{% endfor %}

[scheduling]
    [[graph]]
        R1 = """
{% for k in cfg.keys() %}
            # graph for all events with {{k}} realisations:
            generate_velocity_model<event_{{k}}> =>  stochastic_part<event_{{k}}, realisation_{{k}}> & simulate<event_{{k}}, realisation_{{k}}> => combine<event_{{k}}, realisation_{{k}}> =>  upload<event_{{k}}, realisation_{{k}}> => cleanup<event_{{k}}>
{% endfor %}
        """
[runtime]

I also added this to make the task names shorter:

[task parameters]
    [[templates]]
{% for k in cfg.keys() %}
        event_{{k}} = _e%(event_{{k}})s
        realisation_{{k}} = _r%(realisation_{{k}})s
{% endfor %}

Result:

Ah interesting, I hadn’t thought about modelling it as a dictionary of level sets - thanks! Two further questions that I would appreciate your input on related to this:

  1. On some clusters we deploy cylc to I have artificially introduced a runahead limit to reduce the storage usage on the cluster, something like: cleanup<event-10> => generate_velocity_model<event>. This way the cleanup scripts can free space for the next batch of simulations. Because the events now exist as level sets, this idea no longer works I gather?
  2. In our largest workflows looking at the New Zealand national seismic hazard model we expect tens of thousands of realisations total, with some high-hazard events having about a hundred realisations each. We haven’t used cylc for this (because the last time we did this we hadn’t fully adopted it), but I anticipate doing so later this year. At that scale, should I consider something like your subworkflows idea, or trying to turn the above into a cyclic workflow, or is cylc perfectly happy managing several hundred thousand tasks in a single workflow? My main concern with that approach is being able to control how many jobs are being queued at any given time.

Question 1:

I don’t quite understand this question. Runahead limit has a specific meaning in Cylc: it restricts how far ahead (in terms of cycle points) the “fastest” tasks (e.g. data retrieval tasks at the top of a cycle) can get ahead of the “slowest” tasks, in a cycling workflow.

Your original post doesn’t mention cycling, and your example here is parameter (i.e. task name) based with no cycling evident either.

So perhaps you could clarify what you mean by “runahead limit”, and explain how cycling is used (or not) in your workflow?

Question 2:

This really depends on what is meant by “several hundred thousand tasks in a single workflow”.

Cylc can handle an arbitrarily large number of task instances (cycle-point/task-name instances) in a cycling workflow, but so many individual task names and definitions would likely not fly (whether generated by parameter expansion or not) . Well, I’m not aware of anyone trying anything that big, and I’m guessing it’s way too much.

Cycling is dynamic, so you can run as many tasks as you want by running as many cycles as you want (and in Cylc, you can run as many cycles as you want concurrently - up to the configurable runahead limit) if the dependencies allow that.

So instead of using parameters to generate ALL the tasks for the entire workflow run in a non-cycling graph (if that is indeed what you are doing?) you should reformulate it as a cycling problem, using integer cycles to process all the events say.

“Tens of thousands of realizations” for some events also sounds pretty extreme, so it may be a good case for multiple workflows, I think: e.g., have a single workflow for each event, within which you cycle over the realizations (running as many cycles concurrently as your system can handle).

Interesting use case!

More thoughts:

  • It looks like you’re putting a lot of disconnected graphs into a single workflow. These could just as well be separate workflows, especially if the uber workflow becomes unwieldy.
  • The primary case for sub-workflows is dynamic graph structure (if a task at run time determines the structure of a downstream sub-graph, and that structure can’t be known at start up, make the sub-graph be a sub-workflow) - it doesn’t look like you have this need
  • Monitoring and controlling many workflows at once is not a reason to use sub-workflows - the Cylc web UI shows all your workflows on the left, with a summary of activity in each, so you can see them all and click into them as needed. (And it’s easy to start and stop many workflows at once, e.g. by scripting the CLI).