execution time limit = 15m
How do we grab that execution time limit and pass on to the other script for conditional check against age and time.
IE ${CYLC_TIME…}
Thanks.
execution time limit = 15m
How do we grab that execution time limit and pass on to the other script for conditional check against age and time.
IE ${CYLC_TIME…}
Thanks.
Pass it to which “other script”? The job script that the limit applies to?
Shell script. It can be any script but more scope on the shell script aka .sh
need to grab the execution time limit from flow.cylc and need to have it in other script for compare against time.
You can use the ‘cylc config’ command to extract any config item from the workflow config.
Example of Hilary’s answer for a case where taskB wants the execution time limit for taskA
ETL_taskA=$(cylc config -i '[runtime][taskA]execution time limit' ${CYLC_WORKFLOW_ID})
You could also see if your scheduler (pbs, slurm, etc) exposes the requested time limit, extract it from the scheduler metadata, or see if your admins can make it a variable via hooks or grab it from the job script with a grep.
On a related note though, we have done similar, but just got it from the qstat output I think,so it might be nice if cylc did have it as a variable, in units of seconds.
You can use isodatetime to parse an ISO 8601 duration as a number of seconds, e.g.,
$ isodatetime PT1H --as-total S
3600.0
Yes, but then you have the overhead of python startup with lots of libs being loaded for something that cylc already had converted to seconds to put in the job script. If cylc were to expose it, it’s better to expose it in seconds.
@TomC - by “have it as a variable”, where do you mean exactly?
In the job script that the limit applies to?
I asked that question of @hollabigj above and they said “It can be any script” - in which case it’s not something that Cylc can do.
It already gets written to the job script (that it applies to) as (e.g.) a PBS directive in seconds.
Do you want Cylc to set it as a shell variable in that job script as well? If so, what’s the use case for that?
I read that as they want to, within a shell script, launched via Cylc, know the time requested.
How I perceive the potential workflow of using the number of seconds requested for a job:
I’m not saying the above design is ideal or optimal, but if it is a design pattern used (I do know of one system that does this), then having a CYLC_REQUESTED_EXECUTION_TIME_LIMIT (or similar) variable defined in the shell script would be useful for this limited use case.
Thanks @TomC - fair enough, I’ve never seen a request for that sort of thing before, but I guess it’s plausible.
Here’s two solutions (both of which work right now):
Template The Variable
Simplest approach, make the ETL a Jinja2 variable and do with it what you want:
[runtime]
{% set ETL = PT30S %}
[[my_task]]
execution time limit = {{ ETL }}
[[[environment]]]
REQUESTED_EXECUTION_TIME_LIMIT = {{ ETL | as_duration('s') }}
See also the as_duration filer.
Extract It From CGroups
EDIT: I might be wrong about this one, haven’t checked if ETL is implemented via CGroups in PBS / Slurm.
PBS, Slurm, etc use CGroups to manage job resources.
CGroup information including requested limits is available via the CGroup virtual filesystem (/sys/fs/cgroup). I don’t know which file/field off the top of my head, but it’s in there apparently.
We’re looking to extract this information in the near future for the purpose of resource allocation optimisation within Cylc (i.e, comparing requested wallclock and memory to utilised values).
I’ve never seen a request for that sort of thing before, but I guess it’s plausible.
I don’t think we’ve seen a request for this either.
I need to think over it more, I’m not sure if “execution time limit aware” jobs are a pattern or anti-pattern.
We do have a few workflows where the “execution time limit” is calculated (i.e, estimated) in the workflow definition or in tasks (via broadcast). Which solves the problem the other way around, somewhat nicer on the batch system and saves messing around with the job, but you have to configure the calculation yourself. In the extreme case, we have a workflow where task runtime is very hard to predict, they solved this by highballing the ETL, then using an end-of-cycle task which monitors task runtime and reduces the ETL (via broadcasts) over subsequent cycles.
But anyway, here’s an alternative way to approach the problem, tested with background, PBS and Slurm job submissions.
Rather than setting a timer inside of your job, this listens for the XCPU signal and uses it for clean shutdown:
flow.cylc:
[scheduling]
[[graph]]
R1 = sleepy
[runtime]
[[sleepy]]
script = sleepy
execution time limit = PT5S
bin/sleepy:
#!/usr/bin/env python
from time import sleep
import signal
CONTINUE = True
def handle_signal(*args):
global CONTINUE
print('Caught XCPU')
CONTINUE = False
for sig in (signal.SIGINT, signal.SIGTERM, signal.SIGINT, signal.SIGXCPU):
signal.signal(sig, handle_signal)
def run(job):
sleep(1)
for job in range(1, 1000):
if CONTINUE:
print(f'Run job #{job}')
run(job)
else:
print('Exit cleanly')
break
Implemented in Python, but obvs can be done in other languages too. A benefit of this is that it’s the standard approach for clean shutdown and can be abstracted to cover other signals (TERM, INT, etc).
But perhaps more generally, at least in Python, this can be reduced to:
dirty = False
try:
for _ in range(1, 1000):
dirty = True
do_thing():
dirty = False
finally:
if dirty:
tidy_thing()
Thanks @oliver.sanders
The templating approach is so easy that IMO we don’t need to consider adding a new environment variable to all jobs when the vast majority of them will never need it.
And the XCPU signal interception example is very nice. At first glance it may seem more difficult than a shell-scripted timing loop, to many users, but it’s cleaner and really not that difficult.
Maybe we should document both of these, under “Handling variable run-length jobs”…