Having trouble understanding the mechanics behind the "late" event

Hi all,

As we migrate more and more of our suites to Cylc, it’s becoming more and more difficult to keep an eye on the suites to ensure they start when they should. This means the only way we can tell something is wrong is if the cycle points are behind the wall clock (depending on the suite). When we have dozens of suites with 10+ cycle points loaded in each, the gscan interface gets choppy and long (i.e., difficult to read).

A potential solution was to implement an email on late events but what we’re finding is the “late offset” setting doesn’t seem to be getting respected.

I have it set to PT12H, since most of our cycle points are every 12 hours, so we have T00/T12 graphs, but tasks will naturally run sometime in that 12 hour window depending on what triggers them. With a “late offset” of PT12H, I would expect tasks to not trigger the late event until the next 12 hour cycle, or am I misunderstanding the mechanic?

Right now, I have tasks in 20210121T1200Z kicking off late events but I wouldn’t expect them to do that unless they hadn’t run yet by 20210122T0000Z…

Does that make sense?

Thanks!

I think you’re understanding it correctly.
We need to try to find out what’s going wrong.
I suggest you have a look in the log/suite/log file.
For each late event you should see something like:

2021-01-22T09:30:00Z WARNING - [late.20210122T0929Z] -late (late-time=2021-01-22T09:30:00Z)

This confirms what “late-time” is being used (1 minute after the cycle time in this case).
You can use cylc get-config to confirm what is configured for that task:

$ cylc get-config --sparse -i [runtime][late] $(basename $PWD)
script = true
[[[events]]]
    late offset = PT1M
    mail events = late

It all seems to be working as expected in my simple test case.

Hi @russbnavy

In Cylc 7 the “lateness” of a task is determined by checking the wall clock time against the cycle point of the waiting task object in the scheduler. This works because Cylc 7 spawns waiting tasks ahead before they are needed (to track satisfaction of their own prerequisites).

Cylc 8 (we’ll be making the first beta release soon) has a more efficient cycling scheduling algorithm that does not spawn waiting tasks before they are needed. This is 99% good news on many fronts BUT it does mean that late events as currently implemented only trigger once the late task actually starts running (Cylc 7: I should have started running by now, but I haven’t; Cylc 8: I’ve started running later than expected … which is less useful, particularly if you’re concerned that a late task might never run because of missing prerequisites).

So, it remains to be seen whether or not we can make useful late events in Cylc 8. (We just realized this problem when considering your post above; late events are not widely used so they didn’t feature very prominently on our radar :grimacing:).

Something else to consider is that built-in late events can only work if the scheduler is up and running - an important task could be late because its parent suite was killed. For that reason (I think) NIWA operations doesn’t use Cylc late events. Instead we configure an IT infrastructure monitoring tool (Icinga2) with the expected start times of important tasks, and it alerts operators if it doesn’t receive timely messages sent on task started events. From our operations wizard (also on the Cylc team) @dwsutherland :

We use Icinga2 for our alerting (alongside Cylc’s immediate alerts) via event hooks… I made a script to interrogate the workflow, work out the interval between runs, dependencies, and then spit out the configuration for Icinga… We use a heart beat sent on success, and if the heart beat isn’t received by the threshold (interval+buffer), we get some level of alert

1 Like