Cylc-defined environment variable occasionally not found

funkapus · May 27, 2020, 9:38pm

Occasionally, but apparently unpredictably, the suite I’m developing fails during a run for an odd (to me) reason. The Cylc Users Guide tells me that CYLC_SUITE_OWNER is a variable defined by Cylc and available to my suite task scripts. I access it in a Python task script with:

CYLC_SUITE_OWNER = os.environ[‘CYLC_SUITE_OWNER’]

But occasionally, this command fails: the environment variable is unknown, e.g.

$ more job.err
Traceback (most recent call last):
  File "~/cylc_work/NIMO/bin/fetch_dynamic_params.py", line 11, in <module>
    from NIMO_utils import writelog, debug_var_dump, cleanup_workspace, \
  File "~/cylc_work/NIMO/bin/NIMO_utils.py", line 11, in <module>
    from constants_cylc import *
  File "~/cylc_work/NIMO/bin/constants_cylc.py", line 29, in <module>
    CYLC_SUITE_OWNER = os.environ['CYLC_SUITE_OWNER']
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'CYLC_SUITE_OWNER'
2020-05-27T20:45:03Z CRITICAL - failed/EXIT
$

Maybe 49 out of 50 runs, this doesn’t come up; but I just got bit by the 1/50 again, and decided to ask. Anyone have any idea why this would happen? Using cylc-flow-7.8.3.

Thanks much!

D_Sutherland · May 27, 2020, 10:27pm

Hi there - I had this put into Cylc to make it available to the suite’s environment for Jinja2 parsing… Will check the Job environment.

hilary.j.oliver · May 27, 2020, 10:49pm

Hi @funkapus,

Interesting, I haven’t seen that happen before (but then I suspect almost no one uses $CYLC_SUITE_OWNER in job scripts).

That variable, along with $CYLC_SUITE_HOST, is extracted from the suite’s .service/contact file on the job host, in the boiler-plate job init code in lib/cylc/job.sh (in cylc-7):

    typeset contact="${CYLC_SUITE_RUN_DIR}/.service/contact"
    if [[ -f "${contact}" ]]; then
        export CYLC_SUITE_HOST="$(sed -n 's/^CYLC_SUITE_HOST=//p' "${contact}")"
        export CYLC_SUITE_OWNER="$(sed -n 's/^CYLC_SUITE_OWNER=//p' "${contact}")"
    fi

So: $CYLC_SUITE_OWNER should exist in your job environment so long as the contact file exists.

The contact file should exist whenever the suite is running - job status messaging uses it to find out where to send status updates home. It gets created the first time a job is sent to that job host, and it gets deleted when the suite shuts down cleanly.

I think that the first job on a remote host won’t run until the host is initialized (including contact file creation), so if some external process is not deleting your contact file (seems unlikely) it suggests your job must be starting to execute after suite shutdown, which could happen if it was queued (e.g. to PBS) when you did a shutdown with --now, so that the orphaned job was released (e.g. by PBS) to run after the shutdown. Is that what happened? If not, can you say anything more about when the problem occurs? (e.g. is it always the first job on that job host, in a new suite run?).

As a workaround you could set your own HOST and USER (/owner) variables in the job environment by extracting them from the suite host environment with Jinja2 at start up. But be aware that orphaned running jobs will then end up with the wrong values (if it matters) if you restart the suite on another host (e.g. via self-migration from a condemned host) … something which the contact file method handles nicely.

hilary.j.oliver · May 27, 2020, 10:55pm

(I guess we should put default values in the job environment in case users reference these variables, when the contact file is not found - which necessarily prevents a job from reporting it status, but it shouldn’t stop a job from running).

(And I guess we should also document that CYLC_SUITE_HOST and CYLC_SUITE_OWNER in job environments will - quite correctly - not have valid values if the parent suite daemon is no longer running at the moment the job starts up).

D_Sutherland · May 27, 2020, 11:05pm

The init-script runs before hand:

github.com

cylc/cylc-flow/blob/7.8.x/lib/cylc/job.sh

#!/bin/sh

# THIS FILE IS PART OF THE CYLC SUITE ENGINE.
# Copyright (C) NIWA & British Crown (Met Office) & Contributors.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

###############################################################################
# Bash/ksh93 functions for a cylc task job.

This file has been truncated. show original

Something running here? Or perhaps there’s a file system issue 1/50 times?

hilary.j.oliver · May 27, 2020, 11:14pm

Yes, but that’s just the optional first piece of user-defined job scripting, it’s not the remote host init process that I was referring to (which creates the contact file on the job host).

funkapus · May 27, 2020, 11:19pm

Thanks for the fast reply!

First off, this is not an issue that’s critical to my suite. The only reason I use this particular Cylc environment variable at all is that I have debug levels internal to my suite; at the most detailed level, each task starts by dumping all the known Cylc- and suite-defined environment variables to a text file in the task work dir. So I’m only grabbing the value of that environment variable so that I can dump it, along with all the others; I don’t actually use it for anything else. If I were never able to solve this problem, I’d just remove the print() from the dump and remove the os.environ[] call. So there’s not really a need for a workaround.

The task in question does not run on a remote host – in fact, it’s not even running through PBS. It’s a script that fetches some data over the network. The suite wasn’t shutdown before execution (by me, anyway). It is a task that’s triggered by an external trigger (cycle point = wall clock time). When the suite starts, its initial cycle point is the next 15 minute mark. Upon startup, several other initialization/configuration tasks run. Typically they finish well before the 15 minute mark occurs; if so, the suite sits idle until that time. When those startup tasks are complete and the wall clock prerequisite is met, this task gets triggered.

hilary.j.oliver · May 27, 2020, 11:33pm

OK, thanks for the additional info. The thick plottens… we’ll have think about how that could be possible (that the contact file is not found on the suite host, when a local task starts executing). Good news that you’re not using the variable for anything important at least. (It’s certainly not a useful variable in local jobs!).

hilary.j.oliver · May 27, 2020, 11:50pm

This is very strange, because (on the suite host) the contact file is only written at start-up, and only removed at shut-down, but from your description the problem task job executes well between those two points.

hilary.j.oliver · May 28, 2020, 12:04am

@funkapus, in your Cylc job configuration, are you executing your environment-print script via the normal script or pre-script or post-script items, not - referring to @dwsutherland’s comment above - init-script which runs first thing in the job script, before the CYLC_ environment is exported. If you are using init-script it would fail every time though, not ~1/50 times, unless perhaps you background the process (init-script = print-env.py &) rather than running it serially (init-script = print-env.py)

funkapus · May 28, 2020, 12:07am

Perhaps the problem is not directly related to Cylc? I doubt it’s some kind of filesystem issue because I think it’d turn up in some other way (it’s a busy system with several users). But if tasks communicate with the Cylc server via http/s by default (I haven’t changed any defaults on this machine), maybe it’s a symptom of some intermittent networking issue?

Unfortunately I nuked everything before realizing that the server log would be useful too.

hilary.j.oliver · May 28, 2020, 12:08am

One more question, can you clarify this statement:

but apparently unpredictably, the suite I’m developing fails during a run

“The suite fails” normally means the suite daemon aborted with an error (which would cause contact file deletion), but I’ve been assuming from the rest of your description that that did not happen. The only thing that failed was one of the jobs in the still-running suite. Is that right?

Topic		Replies	Views
Migration: unbound variable Cylc 8 Migration	2	28	March 7, 2025
Cylc 7: Unexpected disagreement between global.rc settings and Cylc environment variables Cylc Support	4	276	June 7, 2023
List of deprecated Cylc variables? Cylc Support	7	375	November 14, 2023
Cylc equivalent of ROSE_SUITE_DIR? Cylc Support	2	268	October 7, 2022
Job scripts appearing in wrong suite directory Cylc Support	3	425	June 26, 2019

Cylc-defined environment variable occasionally not found

Related topics