Best practice for on demand systems?

Hi all,

I thought it would be good to get some advice from around the world on this topic. How do other people do on-demand models in Cylc, where an external something (message? file?) triggers the suite to run?

First, I should say I (so far) think of two types of on-demand systems:

  1. Completely on demand, can run anytime anyday, and it is not linked to any particular cycle point
  2. On demand within a window, and tied to a cycle point/basetime

Type1

I can see several ways of handling this, depending on how the infrastructure and security is setup

  1. Cycle that runs every X minutes. First task checks for a file in a remote location. If file exists, cycle runs, otherwise it suicides the rest of the tasks. I presume this needs to be a datetime cycling suite, as I don’t think integer cycling suites can have time triggers?
    pro: Good visibility that it is doing something. Uses a cyclepoint which is fairly standard for all other suites we use
    con: more overhead, arbitrarily tied the suite to a cyclepoint which has no meaning
  2. Integer cycling suite. Xtrigger which polls for a file or event from a message broker. Same xtrigger will just continue indefinitely until one is found. This could take days, weeks, or months to be satisfied. I’m working on the assumption xtriggers don’t expire (it would be nice if you could make an xtrigger expire by the way, but the implementation of that is a bit awkward to do well - I certainly don’t have a vision of how to do that in a generic way).
    pro: less overhead in the suite
    con: limited visibility to people monitoring that the suite is doing anything
  3. Integer cycling suite. Launches a task somewhere which can check the status of something else, fail/retry approach.
    pro: I can’t really think of any
    con: I don’t think you can fail/retry indefinitely. I personally really dislike the fail/retry mechanism for polling as it just makes the suite always look like its in a state of failure

Are there other approaches I haven’t considered? Have people tried different approaches like above and have thoughts on which worked best? I have used approach 1 before, but I do find it clunky.

Type 2

Slight variations on the above. All instances assume a datetime cyclepoint suite.

  1. One cyclepoint at each basetime that the suite is linked to. Long lived polling task looking for file/event.
    pro: Really easy to implement
    con: Sysadmins tend to hate long lived polling tasks
  2. One cyclepoint at each basetime that the suite is linked to. Fail/retry mechanic to poll for files. If you alert upstream about no triggering file being sent (in case they forgot), add in a task with a wallclock which gets removed from the suite if it isn’t needed.
    pro: Simple to implement. Won’t annoy sysadmins too much
    con: I don’t like fail/retry mechanics for polling
  3. One cyclepoint at each basetime that the suite is linked to. Create an xtrigger to poll for the event/file. Add in an expiry offset to the xtrigger, which will succeed after a certain time with a dictionary construction that the next task will check to determine whether to run anything in the cycle or not. Add an alerting task as above.
    pro: Polling gets masked behind the scenes. Shouldn’t annoy sysadmins too much.
    con: I can’t think of too many off of the top of my head. Assumes access from the Cylc server to the message or file. A bit more awkwardness in the xtrigger, but not much.

So, from the ideas I’ve got above, 3 makes the most sense to me, then the other two are a bit of a toss up.

Ok, that is long enough I think. Any advice that people have from their experience would be greatly appreciated.

  • Tom
1 Like

Hi Tom,

There’s a lot to comment on and reply to there, I’ll have a crack at it as soon as I have some spare minutes.

For now, one major thing to note: we will have better ways of doing this sort of thing in Cylc 8 thanks to Python 3’s asyncio (we can ditch the cylc 7 xtrigger execution model of polling via sub-processes).

Hilary

1 Like

Hi Tom,

Type 1 (any time any day)

  1. Cycle that runs every X minutes …

I don’t like this approach. For one thing, removing all the tasks by suicide trigger most of the time is nasty. For another thing, you’re essentially pretending that the date-time cycle is a real-time schedule when it isn’t. If it gets behind the clock for any reason (e.g. the host VM goes down) Cylc will automatically run cycles as quickly as possible to catch up again - during which time you’re checking for the remote file too frequently. See previous post on how hold Cylc back from cycle interleaving, and whether or not we should support a simple real time scheduling mode. Perhaps a better way to do this sort of thing would be to use cron to launch a non-cycling Cylc suite at regular intervals.

  1. Integer cycle with xtrigger that returns when the file appears…

To me, this is a good approach. Integer cycling for arbitrary repetitive processes, and the xtrigger implements dependence on the outside world. However I agree that “visibility” in suite monitoring is a problem in Cylc 7. In Cylc 8 we going to make xtriggers visible even in non-graph views. Yes we should allow xtriggers to abort (/expire), but that’s not implemented yet: https://github.com/cylc/cylc-flow/issues/3232#issuecomment-514899912

  1. Integer cycle with a polling task…

This was the recommended way to trigger tasks off of the external world before xtriggers. Get a task to poll for the external condition, and then trigger others off of that task’s success or failure. You can either have a single long-running task that polls repeatedly until the file is found, or have Cylc to do the polling by repeated job fail-and-retry.

Contrary to your “cons” comment, you can have unlimited retries, and it’s not really true that “the suite always looks like it is in a state of failure” because tasks don’t go to the failed state unless they’ve run out of retries.

xtriggers are meant to be a more elegant approach, or at least more in keeping with the idea of a “trigger” being something that should be managed by the scheduler and not manually implemented in task scripting.

Are there other approaches I haven’t considered?

As suggested above, you could run a non-cycling Cylc suite repeatedly on a cron schedule? (The launch script could register it by a different time-stamped name each run so previous run directories don’t get clobbered).

Another idea is a long-running xtrigger function that does not return at all unless the condition is satisfied. For this you have to be aware of the scheduler sub-process pool timeout (default 10 minutes, I think) after which still-running xtrigger functions will get killed, but the timeout is configurable.

So, there’s lots of options that work. None of them are exactly pretty, but Cylc 8 should fix that!

Type 2

All of your listed options seem more or less fine to me, for this case. Some of your cons are subjective (“sysadmins tend to hate long-lived polling tasks” - well, so what, so long as those tasks don’t use much resource and the sysadmins actually kill them on you; “I don’t like fail/retry mechanics” - users just need to know that “retrying” does not imply an outright failure, it’s a failure
that is being handled).

So, those are my opinions. Others who actually work with these kinds of workflows might have more to say about it.

Hilary

1 Like

Sorry, that’s a bit misleading. There’s no limit to the number of retries you can configure, but you do have to specify a number.

Thanks Hillary. I thought there would be a limit of 99 retries (100 tries) given the log directories are zero padded to two places. I guess part of my concern here is thinking about the extreme case for a suite that is very time critical, but only runs a few times a year. It could be polling for months before it runs and cleans up old log files.

I think xtriggers will be the best option if our cylc VMs can have access to everything they need to poll. I don’t think xtriggers fill up the suite logs by default do they?

And yes, I agree many of my cons are subjective. Sometimes I just felt like I needed to list something. The sysadmin comment more applies anything that might need to poll for data from within the HPC environment, so submits a job via PBS and then sits there taking up a CPU on the HPC. It doesn’t apply to tasks running on the cylc VM itself.

Hi Tom, Hilary,

Are there other approaches I haven’t considered?

What about Old-Style External Triggers? I know the docs currently say these are deprecated but Hilary assured me this is not actually the case, and that an asyncio based re-implementation will be added soon (to Cylc 8?) – @hilary.j.oliver I trust this is still the case?

It depends on whether you have the ability to modify the system creating the external event to be triggered off. But if you can then to me having the external system simply send a message to the suite saying that the waited on event has occurred is the cleanest solution. Using an xtrigger to repeatedly check for the occurrence of the event (essentially polling) represents an (undesirable in my opinion) inversion of control. And the current sub-process based approach seems quite heavy (as I understand it).

The original external trigger feature also has the advantage of guaranteeing the lowest latency – the suite will be triggered the next time the scheduler passes through the main event loop after the remote trigger message is received (< 1 sec). Using xtriggers to poll for external events will always involve a trade-off between response time and acceptable system overhead (current default xtrigger poll rate is 10 secs) – although hopefully the new asyncio based implementation will be a bit leaner here.

A further issue with polling arises if the waited for event is the arrival of some new input data file, since for non-trivial file sizes it can be hard to determine when the file transfer has completed and the data is ready to be processed. Of course there are common tricks for this – renaming at the end of transfer, touching an empty file once the transfer completes, writing a hash file, etc – but most require co-operation from the sending end. And if the sender is prepared to co-operate in this way perhaps they can be persuaded to send an external trigger message? (Of course there may be security implications here if the sender is external to your organization, but there are probably workarounds…)

I did recently experiment with a long-lived xtrigger function that waited for messages from a rabbitMQ exchange, the messages being sent by the external system (for new data arriving from our satellite receivers in this case). But concluded that all I was really doing was re-implementing the functionality already provided by the original external trigger mechanism, so I abandoned that approach in favour of advocating for the retention of the original feature – which turned out to be the plan anyway :wink:

(Yes, there would be some decoupling advantages by using a message broker like this – messages are sent to the broker, not specific Cylc suites, and multiple suites can wait on the same event without the producer having to know about them – but for now I decided this wasn’t worth the additional complexity. Of course I might change my mind in the future…)

Simon

Hilary,

As suggested above, you could run a non-cycling Cylc suite repeatedly on a cron schedule?

I’d be wary of suggesting a cron-based solution in an HPC environment. On our machines where a node reboot is essentially an Ansible redeployment, crontabs do not persists through reboots and so require manual intervention to restore (as I know to my cost :wink: ). Last I heard our HPC sysadmins didn’t have a solution for this other than “please don’t use cron”.

Simon

1 Like

My feeling exactly. That’s why I keep lobbying for a better support of PUB-SUB model for task triggering. The good news is that with the work going into UI-cylc server communication this could be very well possible.

Thanks Tomek,

Useful to see the previous discussions. I’d also be very keen to see native support for pub/sub style triggering in Cylc, especially for asynchronous dataflow based workflows, which is what I mostly work with.

S

Thanks for the response. In my case, there are some instances where we have control of the upstream suite/process/data delivery. But, in many cases its not possible. The only way we can tell is by the existence of a file, so polling in some way is the only option I really have.

If I’m reading your first paragraph correctly, are you suggesting having the upstream system know about its downstream dependencies? I’ve dealt with this approach before, and I find it can lead to some annoyances itself. But, that is perhaps more related to a convoluted approval and deployment procedure such that even small changes usually take a few days to get done. As such, I am more inclined towards upstream systems not knowing about any downstream users. This also works with the approach we are tending towards of using message brokers for sending information between suites.

Hi all,

For the record, I agree with @slw’s points on the downsides of polling solutions, and we have in fact un-deprecated Cylc’s old-style non-polling external trigger mechanism, at least until it is replaced with better things that are now (more easily) possible in Cylc 8 thanks to Python 3.

However, as a matter of practical reality, almost no one used the old-style external triggers because of the fact that the upstream system has to be modified to send trigger messages to the specific downstream system, like for @TomC’s case, and consumers of data from upstream workflows (typically researchers using operational data, for instance) very often have no control over the upstream system.
Polling, though inelegant and less efficient, can be achieved by the downstream system alone, and for most use cases the relative inefficiency is of absolutely no practical consequence. That is why almost everyone used custom “polling tasks” for external triggering in Cylc for ages, and now the newer xtrigger mechanism that internalizes that sort of interaction with the external world. (However, as I said, we will continue to support the old non-polling external triggers until replaced by something better in Cylc 8).

Hilary

(I feel obliged to point out, for users out there who may not be aware of the debate, that their use of polling external triggers is not going to cause any kind of disaster!).

Hi Tom,

Yes, there can be downsides to downstream coupling (especially where there are multiple downstream systems depending on the data, or where the systems concerned are not closely related in other ways), and I’d tend to agree that if direct coupling cannot be avoided then upstream coupling is often preferable (knowing where your input data comes from seems somehow less bad then being expected to know about all the systems that are using your outputs). But in the case of processing asynchronous, real-time events where the event can occur at effectively random times and must then cause processing to start quickly with minimal delay, I would argue that having the upstream system signal the downstream one that there is something to do is a preferable solution than having the downstream system polling the upstream system (possibly for many hours at a time).

I guess I’m also assuming here that the upstream system is not itself another cylc suite and so upstream suite-state triggering is not an option. But I think that was implied by your original question.

I’m also only thinking of this as a good solution for closely related systems – in my case it is effectively our direct broadcast satellite receivers which are triggering processing suites for the satellite data at the end of each satellite overpass (well it’s not the receivers directly, there’s an ingest process in between that receives the data directly from the receivers, but I’m sure you get the idea).

Reducing or removing direct coupling altogether through the use of a message broker and the pub/sub pattern would definitely be cleaner, especially for data driven workflows, but it does add complexity – you’ve now got a message broker in your system and you have to implement the required producer and consumer code to interface to it (as an xtrigger). But maybe this will be easier with Cylc 8’s asyncio based event loop.

S