How to process a large number of datasets fast with Cylc

Cylc’s cycling capabilities aren’t just good for datetime cycling systems.

Say you have N datasets that need to be processed as quickly as possible (or you need to compute something for N different river catchments, or whatever).

(A) parameterized tasks

You could use task parameters to duplicate the sub-graph for each dataset (or catchment, or whatever):

#!Jinja2
[scheduler]
   allow implicit tasks = True
[task parameters]
   m = 1..{{N}}
[scheduling]
   [[graph]]
      R1 = "prep => process<m> => products<m> => upload<m> & archive<m>"

Here’s the graph for N=3. Each parameter value makes a distinct sub-graph (each with cycle point 1).

ex-param

(B) integer cycling

Or you could cycle over the datasets (or the river catchments, or whatever):

#!Jinja2
[scheduler]
   allow implicit tasks = True
[scheduling]
   cycling mode = integer
   final cycle point = {{N}}
   [[graph]]
      R1 = "prep"
      P1 = "prep[^] => process => products => upload & archive"

Here’s the graph for N=3. Now there’s only one set of tasks, but they’re repeated over N cycle points:

ex-cycle

Recommendation: if N is large USE CYCLING!

Both approaches work, and if N is not too large then either one will do.

But if N is large, cycling is much more efficient.

(In fact the same cycling workflow can process an arbitrarily large number of datasets at no extra cost).

E.g. for N=2500:

  • The parameterized case is a massive graph of 10,000 distinct tasks. A large number of tasks become ready at once when prep finishes, then the scheduler has to manage them all, and the UI has to display them all (and more) even though most of them aren’t going to run any time soon thanks to internal queue limits and external resource constraints.

  • The cycling graph has only 4 distinct tasks per cycle, and the scheduler extends the graph dynamically to future cycles, at run time. It (and the UI) only needs to manage the active cycle points. And with no dependencies between cycles it runs all the cycles at once, out to the configurable runahead limit - so cycling need not restrict job throughput.

(Addendum: of course if you set the runahead limit to 2500 cycles then the two methods become more or less equivalent. But if you don’t have an entire HPC to yourself then cycling, with an appropriate runahead limit, will make your life much much easier.)

1 Like

The difference between the two approaches is apparent whenever you parse the workflow config (validation, cylc list, or cylc graph, etc.) not just at runtime. The parameterized graph is massive; the cycling one is a concise pattern for a graph that grows dynamically at run time.

I chose a small N=3 above, so that I could show all the cycles and make the graphs look the same.
But for N=2500 you can still plot (and even run) just the first 3 cycles of the cycling case, but the parameterized graph will be so big that Graphviz may run out of gas doing the layout, and if not you’ll need a very big screen see it all.

This view illustrates how the cycling graph gets extended dynamically (out to 3 cycles … 2497 to go!):

cyc

My team considered Cylc for some workflows we have that fit what you are describing here. We finally chose to look elsewhere for now, one of the main reason being the awkwardness of integer cycling when you are actually cycling over a list of datasets. We are still implementing many cylc workflows where datetime cycling makes more sense.

Do you think it could be a feasible feature to add a third, more generic kind of cycling ? For example, I could see something like “list cycling”. A basic version would be cycling through an ordered list of strings, alike the task parameters example, but without the large N problems. Example:

[scheduling]
   cycling mode = list
   points = riverA, riverB, riverC, riverD

Or even fancier, an “object” cycler, where elements are declared as a list of dictionaries.
Example:

[scheduling]
   cycling mode = objects
   points = cycling.json

# cycling.json
[
   {'id': 'riverA', 'basin': 'Basin1'},
   {'id': 'riverB', 'basin': 'Basin1'},
   {'id': 'riverC', 'basin': 'Basin2'}
]

Where the env variables “CYLC_TASK_CYCLE_POINT_id”, “CYLC_TASK_CYCLE_POINT_basin” are injected in runtime.

That’s not particularly awkward, you just need a mapping between integer cycle point and dataset ID.

Most obviously, perhaps, cycle from points 1 to N, where N is the number of datasets, and use the cycle point to index into the list of datasets.

Absolutely feasible, and not difficult to implement. In fact we’ve long had this in mind, but haven’t got around to it (because doing it with integer cycling is easy enough). How to express inter-cycle dependence might be an issue, but for this kind of workflow there’s typically no need for that anyway.

1 Like