Hi!
I have a long-running task that often fails with random read errors. I have to fix this, but for the mean time I simply made it retry up to 3 times and in 99% of the cases it works on the second try.
However, this task can also fail for real reasons, when the input data is invalid. I set up a specific error code and an err-script that detects it and sends a custom message, which emails me.
The issue is that the email is only sent on the 4th run, this is a misuse of resources and time. How can I make the job go into fail state when this error-code/message is triggered ?
Here’s a simplified version of the runtime section:
[long_task]]
script = """
# this script might fail with random read errors (exit code 1)
# or it might hang for no apparent reasons (that's why the time limit)
# or with invalid data (exit code 66)
python -m the.long.running.script
"""
err-script = """
if [ $CYLC_TASK_USER_SCRIPT_EXITCODE == 66 ]; then
cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" 'Invalid Data'
fi
"""
execution retry delays = 3*PT30M
execution time limit = PT5H
[[[events]]]
mail events = invalid
[[[outputs]]]
invalid = Invalid Data
Thanks!