Revisiting sequential tasks

I’m revisiting sequential tasks in the context of setting up some cron-like suites to handle a few tasks. Specifically, I have a data processing application that runs at a particular cycling frequency, but when things get backed up, it may run for longer than the cycling interval (e.g. cycles PT5M but may take up to PT15M to run). Because each instance processes whatever data is present, the intervening cycles don’t need to run, so I expire them. This leads to a suite that looks like:

[cylc]
    UTC mode = True
[scheduling]
    initial cycle point = now
    [[special tasks]]
        sequential = foo
        clock-trigger = foo(PT0M)
        clock-expire = foo(PT1M)
    [[dependencies]]
        [[[PT1M]]]
            graph = """ foo"""
[runtime]
    [[foo]]
        script = """ sleep 150 """

On its face, this seems to do exactly what I need but when executed, this doesn’t actually work. Task foo at the initial cycle point runs properly, downstream tasks expire properly, but once a task is expired every task thereafter ends up expired because the sequential line ends up translating to a dependency of foo[-PT1M]:succeeded => foo. It appears that the “sequential” logic does not handle intervening task expirations. This led me to the next “I only want one of these running at once” solution: an internal queue of size one.

[cylc]
    UTC mode = True
[scheduling]
    initial cycle point = now
    [[queues]]
        [[[limit_1]]]
            members = foo
             limit = 1
    [[special tasks]]
        clock-trigger = foo(PT0M)
        clock-expire = foo(PT1M)
    [[dependencies]]
        [[[PT1M]]]
            graph = """ foo"""
[runtime]
    [[foo]]
        script = """ sleep 150 """

This “works”, except that in the real case (not this contrived example), I don’t have just one of these tasks, and the queue semantics means that I have to have a separate, single-member length-1 queue for each task.

Is there a better solution for this? Ideally the “sequential” configuration would handle this, but as I started piecing it out I couldn’t see an easy/fast solution to make that efficiently work. Perhaps someone here has better ideas.

The other option I considered would be an xtrigger that would check the suite to see if an instance of that task was already running:

[cylc]
    UTC mode = True
[scheduling]
    initial cycle point = now
    [[xtriggers]]
        sequential = sequential(name=%(name)s)
    [[special tasks]]
        clock-trigger = foo(PT0M)
        clock-expire = foo(PT1M)
    [[dependencies]]
        [[[PT1M]]]
            graph = """ @sequential => foo"""
[runtime]
    [[foo]]
        script = """ sleep 150 """

where the “sequential” xtrigger function polls the current suite for all instances of task %(name)s and returns True if none of them are in a running state (e.g. submitted, running, retrying).

Hi Tim,

As you’ve discovered, the (ancient and deprecated) “sequential” setting is exactly equivalent to a previous-instance trigger in the graph. In fact the user guide says so, and recommends using the graph instead: https://cylc.github.io/doc/built-sphinx/suite-config.html?highlight=sequential#special-sequential-tasks. So that’s why sequential and clock-expire won’t do what you need (dependencies don’t magically transfer to the next task in line, on clock expiration).

Given that “each instance processes whatever data is present” the simplest solution might be integer cycling with previous-instance dependence. Each task job should just spin its wheels for a few minutes until sufficient data is available, then process the data and exit with success to make way for the next task. You could even use an xtrigger to poll for the data, to avoid the polling task. And the task or xtrigger function could even wait until at least the next clean 5 minute boundary, before checking for data, to pretty much replicate your 5 minute date-time cycling idea. Would that work?

Hilary

Thanks, Hilary. I’ve done some data processing before with integer cycles, but typically driven by another datetime cycling suite to handle the timing aspects.

In this case, I don’t know if that’s necessarily appropriate. I’m not sure I understand where the timing is coming from in an integer cycling suite, unless I implement cycling logic myself (which seems redundant given that “implementing cycling logic” is exactly what Cylc does!) inside the tasks.

Additionally, it’s not just a single task running on an interval. There’s typically a group of tasks, each running with their own interval (e.g. some at PT5M, some PT15M, some PT1H, etc.).

I’m a bit surprised that the sequential use case doesn’t come up more often (i.e. make sure subsequent task executions don’t conflict with their previous instances) - sometimes this is a matter of using cycle-point specific directories, but not always.

Finally, I agree that data-driven processing is the “correct” way to handle this - no need for timing at all. However, the infrastructure isn’t quite ready to support that yet so this is an intermediate step.