Separating task stdout from stdout of script within? Buffering issue?

Hi all,

Nooby query on expected behaviour of a script’s stdout, and how this interacts with the parent cylc task’s own stdout, as I’m running into something which looks like the two streams are getting slightly mixed up.

Have had a quick look through cylc docs / tutorial / discourse, but haven’t spotted anything - do please point me in the right direction if I’ve missed something!

I’ve got a workaround, but wondering if there’s anything else clever I could be doing. Or indeed if having something “clever” is something which could be an enhancement / documentation improvement.

Details below - happy to provide any more needed!
Thanks,
Edmund

Goal
I’m basically trying to look at how the contents of a model directory get changed as various stages of the model get run. This is for debugging - to isolate the stage where something gets created in the wrong place, and hence work out why.

Setup
CYLC_VERSION=7.8.3
ROSE_VERSION=2019.01.2

I’m tackling my goal by running a (parametrised) task locally (submitting via background) - this runs a tree command on the model directory - a (read-only) NFS export of the HPC where the model is run.
Specifically, my suite.rc contains a family and task along following lines:

[[LOCAL_JOB]]
    # Settings for job to be submitted on local machine
        [[[remote]]]
            # set current host (where "rose suite-run" cmd was invoked)
            host = {{ ROSE_ORIG_HOST }}
        [[[job]]]
            batch system = background

    [[model<stage>_list]]
        inherit = LOCAL_JOB

        script = """
            set -e
            echo "Listing after ${CYLC_TASK_PARAM_stage} stage:"
            tree -Difanqpugs --dirsfirst ${LIST_DIR}
            """

        [[[environment]]]
            LIST_DIR = "{{ HPC_HOME_NFS }}/{{ MODEL_DIR }}"

Result
As I’ve not directed the script’s stdout anywhere separate, so this is going to the job.out.
In the job.out, this results in something like the following, where the cylc task’s <utctimenow> INFO - started message has got mixed up (interleaved) with the tree output from the task’s script:

Suite    : mi-su173
Task Job : 20190501T0000Z/model_copy_list/01 (try 1)
User@Host: me@mypc

Listing after copy stage:
/hpc/path/to/model
[drwxr-xr-x me  users           4096 Jun  4  2019]  /hpc/path/to/model
<snip ~50 lines>
[-rw-r--r-- me  users          34497 Jun  2  2019]  /hpc/path/to/model/src/code.f90
[-rw-r--r-- me  users           1158 Jun  2  2019]  /hpc2020-03-07T14:10:45Z INFO - started
/path/to/model/src/Makefile
[-rw-r--r-- me  users           1906 Jun  2  2019]  /hpc/path/to/model/src/morecode.f90
<snip loads>
500 directories, 5000 files
2020-03-07T14:10:49Z INFO - succeeded

I’m guessing this mixing of the tree output with the task messages is because the two stdouts (task and script) aren’t explicitly {un}buffered or otherwise handled to avoid this sort of mixing.
And the size of the tree stdout here (/some other property of tree/background/other) may also contribute.

Workaround 0:
I was fine with this tree output being in the job.out: I’m just doing a visual inspection (and don’t expect to need to do this much in future - just as I debug this case), so could ignore (irrelevant) task message timestamp differences when I come to review diffs later.
I can definitely be lazy and do the same “just ignore it” approach with this stdout “collision” - inspect visually / hand-edit if really needed (there aren’t that many stages to diff).

Workaround 1:
I’m probably going to go for a slightly better approach - put my script = """<as above>""" contents into a list.sh script in the suite’s bin dir, and within that list.sh direct the tree output into its own file*.
This to better separate concerns: effectively treat the tree output as “data” worth separating from stdout (as it basically is for my purposes). A viewpoint with helpful side-effect that it avoids this sort of collision!
* and in process expand out all those opaque tree flags!

Questions
What I’m wondering is if there are alternative approaches I could have used?

E.g. could I redirect the script stdout within the [[model<stage>_list]] task in suite.rc? Something like:
script = list.sh 1> ${CYLC_TASK_WORK_DIR}\list.stdout 2>${CYLC_TASK_WORK_DIR}\list.stderr
Can’t see anything in docs/tuts indicating this sort of script stdout/err redirection within suite.rc is possible - I may be missing something though.
The nearest I can see is the note covering PBS stdout in https://cylc.github.io/doc/built-sphinx-single/index.html#task-stdout-and-stderr-logs, but don’t think this applies to my (background-submitted) job here, unless I’m (again) missing something?

Alternatively, is there some cylc-level setting I can set to handle the buffering of the script/task stdout, such that if script is writing to stdout, I can avoid these sorts of collisions between the script and task stdouts? E.g. to flush the task’s own stdout, so that the <utctimenow> INFO - started message always comes before* the script’s stdout in the job.out?
* or after - am not that fussed as long as it’s deterministic!

Or is there another standard way of handling this (other than my workarounds!) which I’ve missed?

Or is this something which isn’t currently provided / is but isn’t documented?

Yes, you can and that’s probably the cleanest solution. There’s in fact nothing stopping you from using most features of Bash in your job script. The exceptions are undoing any of set -euo pipefail and setting signal traps, which may interfere with Cylc job implementation[*].

You can also put these extra log files in ${CYLC_TASK_LOG_DIR}, so that they are viewable via rose-bush, and with extra log files settings you can even enable viewing them directly in gcylc GUI.

[*] Even this restriction might be lifted in Cylc 8 by executing user code in a separate process, so it can’t interfere with Cylc job script implementation (https://github.com/cylc/cylc-flow/pull/3440).

1 Like

Thanks Tomek - very handy to know, especially warning re the signal traps for cylc < cylc 8.

I’ve now gone for creating a separate log file within my (new) list.sh, and think in this narrow circumstance this probably best, as script now a bit more complex (I have to do some dir checking first myself - tree doesn’t seem to exit with non-zero status if target directory doesn’t exist; all messages incl error messages seem to go to stdout).

However this very handy to know - I’d put a -E error trap in my script - will remove it. And for other circumstances, will likely reach for this redirect-in-suite.rc option, as agree it’s ~clearer-up-front!
And good point too re log visibility - hadn’t thought of that - will direct to CYLC_TASK_LOG_DIR instead.

@TomekTrzeciak - just had a v quick look at your PR - looks good!
Didn’t dig deep - specifically, didn’t look through to your cylc-doc changes.
Do let me know if having a “redirect-within-suite.rc” example (maybe also pulling in the extra log files) is something you think would be useful which is currently missing in the changes you’ve made to the docs in PR above - if so am happy to try to make a contribution (as-and-when helpful) for review by you/others.

Thanks @TomekTrzeciak for answering this.

@edmundh - to explain what’s going on here, there is no separate stdout stream for task messaging and user scripting. Task job scripts are bash scripts that start by sending a “job started” message, then execute all user-defined scripting, then end by sending a “job succeeded” (or “job failed”) message. Cylc then arranges for the stdout from that script to be captured to job.out in the task log directory.

So the fact that job.out starts and ends with cylc message output, with the output from your user-defined task scripting in between is deliberate, and (at least arguably!) a good thing: the messages identify the log as having come from a Cylc task job, and provide start and end timestamps for the job.

But if you don’t want that, then @TomekTrzeciak has suggested the correct solution: deliberately direct the output from your scripting to the another file in the same Cylc-defined job log directory.

Regards,
Hilary

1 Like

Great, thanks @hilary.j.oliver - definitely agree re good thing!

To be clearer, my issue was basically that the start timestamp message was being injected/interleaved partway through my task script output - and I don’t want this.

Specifically, I expected/wanted something like following in the job.out:

<a utc timestamp> INFO - started
<my script output>
<a later utc timestamp> INFO - succeeded

But what I got was:

<my script output, partway through (not on a new line) including "<a utc timestamp> INFO - started">
<a later utc timestamp> INFO - succeeded

Explicitly redirecting sounds like the best solution for me to avoid unwanted interleaving of task and script output.

And in future, @TomekTreziak’s PR running user scripts in separate processes may hopefully also mean such interleaving doesn’t happen?

I assume the interleaving happens because the cylc message is run in the background:


This avoids the task being held up if there is a delay communicating back to the suite but it does mean the output can get mixed up.

1 Like

Ah, sorry @edmundh - I missed the bit about interleaved output, I should have read your message more carefully!

As @dpmatthews said, that’s because the message command runs in the background to avoid holding up the important stuff in case of network problems.

But there’s a solution for that too, if it is a problem for you! You can see above, in the line after the one highlighted by @dpmatthews, the process ID of the cylc message command is captured, so you can wait for it to complete before anything else executes:

[runtime]
   [[my-task]]
       script = """
           wait $CYLC_TASK_MESSAGE_STARTED_PID   # <-------- !
           <user scripting...>
                """

There is actually an example of this, with explanation, in the User Guide, but maybe not easy to find: https://cylc.github.io/doc/built-sphinx-single/index.html#checkpointing-with-a-task

Hilary

1 Like

Ah, I didn’t expect this behaviour is actually documented rather than just an internal implementation detail. This won’t work after my PR goes in, but there is a replacement shell function for the same purpose. I’ll discuss this further on the PR.

1 Like