Use of set -euo pipefail in Cylc generated job script

Hi,

In Cylc 8.1.0, we have just noted that set -euo pipefail is added to each generated job script in the background. While we are generally in favour of these options for operational workflows, some consider these options a pitfall and, a story from the same author. For us, we have just encountered an issue due to conda activate scripts. Many activate scripts, for example gdal assume set -eu is not on, and fails if it is. We will need to, in our env-script at the earliest, or at least just before we call conda activate need to disable these options.

Would it break anything to not force them on everyone straight away, or make it configurable? My preference is to leave it to the user to set options that are appropriate for their scripts.

I tend to agree with you there. We might need to revisit this and discuss in the team.

The problem is few users are aware of these options, and we were trying to make Cylc as robust as possible against unnoticed job failures - i.e. tasks that appear to succeed but actually failed to do what was intended.

The approach we’ve taken is (as you’ve noted) apply those “safety” options by default. But users can always unset them in the their own task scripting, if running scripts that reference unset variables or whatever.

1 Like

I thought I should add that we do recognise it is documented behaviour - although a typo exists which was initially confusing.

Oops, sorry about the typo!

Ok, turning of set -euo pipefail in the env-script causes Cylc to always pass. I’m guesing Cylc is relying on it for function failures? I would consider this at least a bug because a user can override those settings, and may for good reason.

[scheduler]
   allow implicit tasks=True

[scheduling]
  cycling mode=integer
  initial cycle point = 1
  final cycle point = 3
  [[ graph]]
    P1 = """
      b[-P1] => b
    """

[runtime]
  [[root]]
    platform = localhost
    env-script = "set +euo pipefail"
    script="""
      return 199
    """

I expect that should cause Cylc to list the task as failed.

2023-03-23T13:45:37+11:00 INFO - [3/b preparing job:01 flows:1] => submitted
2023-03-23T13:45:37+11:00 INFO - [3/b submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-03-23T13:45:44+11:00 INFO - [3/b submitted job:01 flows:1] => running
2023-03-23T13:45:44+11:00 INFO - [3/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-03-23T13:45:50+11:00 INFO - [3/b running job:01 flows:1] => succeeded

Hi @TomC

It’s only the set +e that’s relevant in your example. A slightly simplified version:

[scheduling]
  [[graph]]
    R1 = "b"
[runtime]
  [[b]]
    script = "return 199"  # job fails - "return 199" indicates an error
   # OR
    script="set +e; return 199"  # job succeeds - "set +e" means "ignore errors"!

I should really have said above that set -u (fail on referencing undefined variables) could arguably be left up to the user. But not using set -e would be really dangerous. By default, and with set +e, shell scripts just carry on if a subprocess or function returns a non-zero exit status. By putting set +e in your env-script above you are literally telling the job script to ignore internal errors and try to carry on as if nothing happened. So I don’t think this is a bug in Cylc!

I would consider it a bug. The final return statement is non-zero. Cylc should be able to handle a non-zero return status with set +e and treat it appropriately as an error.

Users can of course use set +e at their own risk. I can imagine you might occasionally need to run some executable that is known to return a non-zero code even though it “succeeded”:

script = """
...
set +e
run-executable-with-weird-return-code
set -e
# now carry on safely
...
"""

OK I understand where you’re coming from, but the problem is, without set -e almost every real job script will exit with apparent final success no matter what errors are made internally.

A contrived example, but realistic I think:

set +e
run-my-huge-atmos-model  # aborts with error status
touch "something.done"  # succeeds!

Congratulations, your task has just succeeded :tada: … but you wont find out [that it really failed!] till some later point in the workflow.

I would not have expected Cylc to fail on return 199 because the call to return is an error in this situation (and causes an error that is then squashed by the set +e) which is different than the job exiting with a non-zero value. Explicitly exiting with a non-zero value causes job failure, with or without set +e as you’d hope.

[runtime]
  [[root]]
    platform = localhost
    env-script = "set +e"
    script="""
      exit 199
    """
1 Like

(Ah yes, good point @jarich ) …

… although, I guess you’re still coming down on the side of not using set -e by default?

… in which case I’d still make the claim that literally zero real job scripts bother to check for success (e.g. that the output files that should have been generated, were generated) at the last, and then explicitly do the appropriate exit N.

I understand and agree with your claim that “real jobs scripts [don’t] bother to check for success”. I have been a proponent of (adding) defensive programming measures everywhere I can influence for many years. That’s not always adding set -e though.

I have suggested that the responsible teams should insert post-phase testing in all sorts of systems throughout my career, especially because in the areas I’ve worked in “some” failures have always been tolerable so long as “some” < “too many”.

Just recently we had an issue where “too many” “optional” observations were missing (for external-to-workflow reasons) and yet this did not cause a failure until well into a model run. Post-phase testing could have caught this. If we assessed “do we have enough observations to continue?” after observations have been gathered (post observations collecting phase), or before the model really kicks off (pre-computation phase), we would have seen the failure sooner and we would have known where to look sooner. I think most of our models do actually do this post observations collecting phase, and it was a surprise this one did not.

Equivalently, some of our workflows are tolerant of generating “enough if not all” of their outputs. So again, it might not be a problem if 5/50 output jobs don’t actually succeed (but get a silent pass); so long as there are not more than 5 failures. As far as I know we still have some workflows which don’t check these outputs and sometimes its not until the products are used that missing products are discovered. Post-product creation phase testing here should check that the output files that should have been generated were generated.

In both of these examples of post-phase testing (opportunities) it’s not the “soft” failures of the earlier scripts that are the problem (assuming no overt coding failures). It’s the total number of “soft” failures. Adding set -e might cause some of the “soft” failures to become “hard” failures (unlikely because most of our scripts already use set -e voluntarily) but I don’t think it solves the problem of “did I collect/collate/generate enough”, which is what you’re raising as the justification.

I use set -e in almost all of my shell scripts and probably 90% of the ones I don’t have it in has probably been because I forgot, but at the same time, if code that “worked” while I was hand-testing it on the command line and manually submitting it, failed as soon as I put it in a Cylc job because of error flags I didn’t add, I would find that confusing.

TomC said (elsewhere) returning with a non-zero value caused a job failure in Cylc 7. I haven’t been able to find the appropriate job.err files for my few tests of this (something I’ll need to sort out) but certainly, given that Cylc puts the script into a function, my earlier assertion that calling return (outside of a function) was invalid and thus being forgiven by set +e is in error. So if returning from the cylc__job__inst__script() with a non-zero value doesn’t look like a failure to Cylc 8, that probably isn’t right (and may imply oddities in the scoping of the error flags).

1 Like

IMO I really don’t think these options should even be configurable, it’s just bad practice that we wouldn’t want to encourage/support. You can explicitly override these safety features in Cylc scripts as desired or use a standalone script (e.g. in the bin/ dir) if you want full control over the shells configuration.

1 Like

I don’t think the discussion is about whether people should use set -euo pipefail, but whether Cylc should be forcing it on people how it does. Almost all of our operational scripts have it on. Those that don’t, I think have set -e, but not set -u enabled, and that is intentional.

With how Cylc is currently setup in its job.sh, we can’t safely override the options though (well, can’t override set -e is more precise as pointed out by @hilary.j.oliver). If we do set +x, then Cylc breaks because it does not check the return code of it’s functions, but relies on set -e being on. This was not the case in Cylc7.

[scheduling]
    initial cycle point = 20200101
    final cycle point = 20200101
    [[dependencies]]
        [[[P1D]]]
        graph = hello[-P1D] => hello

[runtime]
    [[hello]]
        script = set +euo pipefail; return 1

....

2023-03-23T22:33:31Z INFO - [hello.20200101T0000Z] -submit-num=01, owner@host=scs-cylc1-dev.bom.gov.au
2023-03-23T22:33:32Z INFO - [hello.20200101T0000Z] status=ready: (internal)submitted at 2023-03-23T22:33:32Z for job(01)
2023-03-23T22:33:32Z INFO - [hello.20200101T0000Z] -health check settings: submission timeout=None
2023-03-23T22:33:33Z INFO - [hello.20200101T0000Z] status=submitted: (received)started at 2023-03-23T22:33:33Z for job(01)
2023-03-23T22:33:33Z INFO - [hello.20200101T0000Z] -health check settings: execution timeout=None
2023-03-23T22:33:34Z CRITICAL - [hello.20200101T0000Z] status=running: (received)failed/ERR at 2023-03-23T22:33:33Z for job(01)
2023-03-23T22:33:34Z CRITICAL - [hello.20200101T0000Z] -job(01) failed
2023-03-23T22:33:35Z WARNING - suite stalled

If you want to force the options on, that is in the your teams decision. But, people should be able to safely override them if they need to and still know Cylc will register non-zero return status’s from its functions as failures. We are heavily using conda, and many conda packages will force a workflow to fail due to their activate scripts. I personally would prefer not to have to do set +eu; source ~/miniconda3/bin/activate myenv; set -eu (or have that as a function in the global init script or something like that) to ensure Cylc will register failures.

Basically, what I’m suggesting is

# Run user scripts.
cylc__job__run_user_scripts() {
    typeset func_name=
    for func_name in 'env_script' 'user_env' 'pre_script' \
           'script' 'post_script'; do
        cylc__job__run_inst_func "${func_name}"
    done
} 

Gets changed to the below to handle set +e being off.

# Run user scripts.
cylc__job__run_user_scripts() {
    typeset func_name=
    for func_name in 'env_script' 'user_env' 'pre_script' \
           'script' 'post_script'; do
        cylc__job__run_inst_func "${func_name}" || return $?
    done
} 

I think that would cover someone doing set +e?

1 Like

@TomC

Back on this topic, does my reply to your other post make any difference to your view here?

Specifically, while you have looked under the hood, users are not supposed to know (it is not documented) that script items happen to be run as separate functions by the job script boilerplate - that’s just internal implementation.

Given that, why would users put return 1 in their script fragments, as in your example, as opposed to inside their own functions - whose return values they can check if they like?

If I replace return 1 with exit 1 then Cylc 8 does still detect that as a job failure, even with set +e.

Hilary

When I wrote this above, I was thinking just of set -u.

Writing code that routinely references unset variables is lazy and bad, as a rule, but it is an unfortunate fact that we all sometimes have to use scripts like that, but which we don’t have control over and which despite the bad practice internals aren’t actually broken. Users here have encountered that with conda environment activation I think.

The question is, what would cause the most trouble?

  • impose set -u and expect users to bracket occasional offender scripts in set +/-u
  • or don’t impose set -u and get untold billions of bugs when no one ever uses it in their workflows

You can probably tell by the wording which side we came down on in the development team.

IMO, I think Cylc should force these defaults for built-in script settings, there are many good reasons for doing this.

Consider this example in the case of your operational scripts which don’t have set -u enabled:

rm -rf "${FOO}/" # if FOO is undefined this means; rm -rf /

Situations like this simply should not be allowed to happen! That is why we turn on safety features in Bash which make it behave more like other programming languages. It’s better to find breakages sooner and fix them sooner. Cylc makes it easy to adjust the script settings on the fly with broadcasts.

If you really want to disable these safety features there are currently two ways to do this:

  1. Make changes in your Cylc scripts. This does work, we have a few examples in our workflows where this approach is used to work around flaky scripts like so:

    set +eu
    conda activate env
    set -eu
    
  2. Use a dedicated script file e.g. stored in the bin/ directory.

    # bin/my-script
    set -e pipefail
    rm -rf "${FOO}/"  
    

I appreciate that adding an option to change a default might sound simple and harmless on the surface. But this would be a switch which we would have to maintain and support. IMO, this would lead to far more problems than it solves and complicate our support role greatly.

I do understand where the Cylc developers are coming from, and as I said, it is your decision, I just disagree with it :slight_smile:. I guess I see Cylc purely as a workflow engine. You are trying to make it a workflow engine which also enforces various shell settings for what you perceive as best practice (I don’t disagree with your views, but I have pointed out others who consider these options bad practice).

One issue I have is that there will be potential inconsistent behaviour if someone runs a script via Cylc versus without Cylc. If people develop a script, using Cylc, then thigs will fail. If they then run the script outside of Cylc, it will not, because they haven’t ever had the need to add set -euo pipefail at the top of their script. They won’t learn the pro and con of using the flags, and it may lead to problems outside of the Cylc eco-system which may not have existed if they had learnt to write shell scripts safely.

I also take issue with Cylc not being defensive and protecting itself against people doing set +euo pipefail, which is well within the rights of anyone to do. A simple change like I described earlier would add some safety to Cylc. I can send that to you as a MR if you want it.

1 Like