Cylc-defined environment variable occasionally not found

Occasionally, but apparently unpredictably, the suite I’m developing fails during a run for an odd (to me) reason. The Cylc Users Guide tells me that CYLC_SUITE_OWNER is a variable defined by Cylc and available to my suite task scripts. I access it in a Python task script with:

CYLC_SUITE_OWNER = os.environ[‘CYLC_SUITE_OWNER’]

But occasionally, this command fails: the environment variable is unknown, e.g.

$ more job.err
Traceback (most recent call last):
  File "~/cylc_work/NIMO/bin/fetch_dynamic_params.py", line 11, in <module>
    from NIMO_utils import writelog, debug_var_dump, cleanup_workspace, \
  File "~/cylc_work/NIMO/bin/NIMO_utils.py", line 11, in <module>
    from constants_cylc import *
  File "~/cylc_work/NIMO/bin/constants_cylc.py", line 29, in <module>
    CYLC_SUITE_OWNER = os.environ['CYLC_SUITE_OWNER']
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'CYLC_SUITE_OWNER'
2020-05-27T20:45:03Z CRITICAL - failed/EXIT
$

Maybe 49 out of 50 runs, this doesn’t come up; but I just got bit by the 1/50 again, and decided to ask. Anyone have any idea why this would happen? Using cylc-flow-7.8.3.

Thanks much!

Hi there - I had this put into Cylc to make it available to the suite’s environment for Jinja2 parsing… Will check the Job environment.

Hi @funkapus,

Interesting, I haven’t seen that happen before (but then I suspect almost no one uses $CYLC_SUITE_OWNER in job scripts).

That variable, along with $CYLC_SUITE_HOST, is extracted from the suite’s .service/contact file on the job host, in the boiler-plate job init code in lib/cylc/job.sh (in cylc-7):

    typeset contact="${CYLC_SUITE_RUN_DIR}/.service/contact"
    if [[ -f "${contact}" ]]; then
        export CYLC_SUITE_HOST="$(sed -n 's/^CYLC_SUITE_HOST=//p' "${contact}")"
        export CYLC_SUITE_OWNER="$(sed -n 's/^CYLC_SUITE_OWNER=//p' "${contact}")"
    fi

So: $CYLC_SUITE_OWNER should exist in your job environment so long as the contact file exists.

The contact file should exist whenever the suite is running - job status messaging uses it to find out where to send status updates home. It gets created the first time a job is sent to that job host, and it gets deleted when the suite shuts down cleanly.

I think that the first job on a remote host won’t run until the host is initialized (including contact file creation), so if some external process is not deleting your contact file (seems unlikely) it suggests your job must be starting to execute after suite shutdown, which could happen if it was queued (e.g. to PBS) when you did a shutdown with --now, so that the orphaned job was released (e.g. by PBS) to run after the shutdown. Is that what happened? If not, can you say anything more about when the problem occurs? (e.g. is it always the first job on that job host, in a new suite run?).

As a workaround you could set your own HOST and USER (/owner) variables in the job environment by extracting them from the suite host environment with Jinja2 at start up. But be aware that orphaned running jobs will then end up with the wrong values (if it matters) if you restart the suite on another host (e.g. via self-migration from a condemned host) … something which the contact file method handles nicely.

(I guess we should put default values in the job environment in case users reference these variables, when the contact file is not found - which necessarily prevents a job from reporting it status, but it shouldn’t stop a job from running).

(And I guess we should also document that CYLC_SUITE_HOST and CYLC_SUITE_OWNER in job environments will - quite correctly - not have valid values if the parent suite daemon is no longer running at the moment the job starts up).

The init-script runs before hand:

Something running here? Or perhaps there’s a file system issue 1/50 times?

Yes, but that’s just the optional first piece of user-defined job scripting, it’s not the remote host init process that I was referring to (which creates the contact file on the job host).

Thanks for the fast reply!

First off, this is not an issue that’s critical to my suite. The only reason I use this particular Cylc environment variable at all is that I have debug levels internal to my suite; at the most detailed level, each task starts by dumping all the known Cylc- and suite-defined environment variables to a text file in the task work dir. So I’m only grabbing the value of that environment variable so that I can dump it, along with all the others; I don’t actually use it for anything else. If I were never able to solve this problem, I’d just remove the print() from the dump and remove the os.environ[] call. So there’s not really a need for a workaround.

The task in question does not run on a remote host – in fact, it’s not even running through PBS. It’s a script that fetches some data over the network. The suite wasn’t shutdown before execution (by me, anyway). It is a task that’s triggered by an external trigger (cycle point = wall clock time). When the suite starts, its initial cycle point is the next 15 minute mark. Upon startup, several other initialization/configuration tasks run. Typically they finish well before the 15 minute mark occurs; if so, the suite sits idle until that time. When those startup tasks are complete and the wall clock prerequisite is met, this task gets triggered.

OK, thanks for the additional info. The thick plottens… we’ll have think about how that could be possible (that the contact file is not found on the suite host, when a local task starts executing). Good news that you’re not using the variable for anything important at least. (It’s certainly not a useful variable in local jobs!).

This is very strange, because (on the suite host) the contact file is only written at start-up, and only removed at shut-down, but from your description the problem task job executes well between those two points.

@funkapus, in your Cylc job configuration, are you executing your environment-print script via the normal script or pre-script or post-script items, not - referring to @dwsutherland’s comment above - init-script which runs first thing in the job script, before the CYLC_ environment is exported. If you are using init-script it would fail every time though, not ~1/50 times, unless perhaps you background the process (init-script = print-env.py &) rather than running it serially (init-script = print-env.py)

Perhaps the problem is not directly related to Cylc? I doubt it’s some kind of filesystem issue because I think it’d turn up in some other way (it’s a busy system with several users). But if tasks communicate with the Cylc server via http/s by default (I haven’t changed any defaults on this machine), maybe it’s a symptom of some intermittent networking issue?

Unfortunately I nuked everything before realizing that the server log would be useful too.

One more question, can you clarify this statement:

but apparently unpredictably, the suite I’m developing fails during a run

“The suite fails” normally means the suite daemon aborted with an error (which would cause contact file deletion), but I’ve been assuming from the rest of your description that that did not happen. The only thing that failed was one of the jobs in the still-running suite. Is that right?