Help with flow file that doesn't fully execute

Dan_Holdaway · January 10, 2022, 10:02pm

Hi,

I am putting together a suite file with Cylc 8 but find that it will only execute the first few tasks and then hangs. The scheduling part of the flow.cylc look like:

[scheduling]

    initial cycle point = {{suite_properties["intial cylc point"]}}
    final cycle point = {{suite_properties["final cylc point"]}}
    runahead limit = {{suite_properties["runahead limit"]}}

    [[graph]]
        R1 = """
             BuildJedi & Stage
             """
        T00,T06,T12,T18 = """
                          GetBackground & GetObservations & JediConfig & BuildJedi[^] & Stage[^]
                          => RunJediExecutable => MergeObsDiags => SaveObsDiags => CleanCycle
                          CleanCycle[-PT6H] => CleanCycle
                          """
        R1/$ = """
               CleanCycle => CleanExperiment
               """

The tasks BuildJedi and Stage are suppossed to run once (and they do) and the rest during each cycle. GetBackground, GetObservations and JediConfig all run. RunJediExecutable appears in the tui but never runs, even once the others are complete. Is there an obvious mistake in the flow? Or perhaps there’s a better way of defining the dependency on the tasks run ahead of the time loop?

oliver.sanders · January 11, 2022, 9:48am

Hello,

There’s nothing obviously wrong with the graphing.

The log/workflow/log file should provide an explanation for what’s going on. Perhaps one of the upstream tasks returned a non-zero exit code or there’s a task communication issue.

For a healthy task execution you should find log messages matching the following:

<time> INFO - [<task>] => waiting
<time> INFO - [<task>] => waiting(queued)
<time> INFO - [<task>] => preparing
<time> INFO - [<task>] => submitted
<time> INFO - [<task>] => running
<time> INFO - [<task>] => succeeded

For an unhealthy task execution you might find messages like the following:

<time> INFO - [<task>] => failed  # or submit-failed
<time> WARNING - Incomplete tasks:
      * <task> did not complete required outputs: ['succeeded']
<time> CRITICAL - Workflow stalled

If there is an issue with task-communication (in which case tasks might run but be unable to communicate their status back to the scheduler) you might only see state transitions up to submitted but no further.

You might also see (polled) appearing in the log, this is ok if you are using communication method = poll but could indicate issues otherwise.

https://cylc.github.io/cylc-doc/latest/html/reference/config/global.html#global.cylc[platforms][<platform%20name>]communication%20method

hilary.j.oliver · January 11, 2022, 10:01am

Hi @Dan_Holdaway

Your workflow seems to run fine for me, with assumed initial and final cycle points, and [scheduler]allow implicit tasks = True (so each task runs a local dummy job).

In this sort of situation it’s good to try this first (replace all tasks with local dummy tasks - easiest way is temporarily delete your task definitions and use allow implicit tasks = True) to see that the graph does what you intend, because it doesn’t really matter to Cylc where a task job runs so long as the job status messages can get back to the scheduler.

If the stuck tasks submit their jobs to a different platform than the other tasks do, that suggests a network comms issue as indicated by @oliver.sanders

You can also look for the job.out and job.err files on the job host. If the jobs did run but could not send status messages back you should see clear evidence there.

hilary.j.oliver · January 11, 2022, 8:34pm

(That assumes your graph contains only task names, not family names, since family membership is determined by the [runtime] section.)

Dan_Holdaway · January 18, 2022, 4:00pm

Hi @hilary.j.oliver and @oliver.sanders. Thanks very much for the help. I think I have tracked the error down. Essentially what seems to have been happening is that in the file <path>/run1/.service/etc/job.sh I have a two lines like:

cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" 'started' &
CYLC_TASK_MESSAGE_STARTED_PID=$!

It doesn’t find cylc at this point, I presume because the environment is not inherited from where the workflow is installed? So even though from the job logs it looked like everything was finishing properly I think the job is seen as failed or still running. The fix comes from putting the module load into my .bash_profile but this doesn’t seem very satisfactory. Is there something better I could do to load my cylc module at the appropriate time? I tried adding the following to the suite with no luck:

[runtime]
    [[root]]
        pre-script = "module load miniconda/3.9-cylc"

Many thanks,
Dan.

oliver.sanders · January 18, 2022, 4:25pm

Hello,

Jobs cannot always inherit environment from the Scheduler e.g. if they are submitted to batch systems or remote platforms so you will need to ensure the cylc executable is available in the “default” environment.

To do this we recommend using a light-weight “wrapper script” that performs any required activation and installing this wrapper script into a directory in your $PATH, ideally at the administrator level.

For example a simple wrapper script might look something like this:

#!/usr/bin/env bash
# load the environment
module load miniconda/3.9-cylc
# then run the command
exec "$@"

We provide a more advanced wrapper script in the Cylc source code which can handle multiple parallel
Cylc installations at different versions. It can handle installations in virtualenv, Conda and Conda-Pack environments:

github.com

cylc/cylc-flow/blob/760b077b2cf470a32e85160e3c1a2c5f13e894ca/cylc/flow/etc/cylc

#!/usr/bin/env bash

# THIS FILE IS PART OF THE CYLC WORKFLOW ENGINE.
# Copyright (C) NIWA & British Crown (Met Office) & Contributors.
# 
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

#------------------------------------------------------------------------------
# Wrapper script to support multiple installed Cylc & Rose versions. Handles

This file has been truncated. show original

Users can then change the version of Cylc they are using by setting the CYLC_VERSION environment variable.

$ cylc version
8.0.0
$ export CYLC_VERSION=7.8.8
$ cylc version
7.8.8

Cylc ensures the relevant environment variables are carried around with jobs and subcommands (including those on remote platforms) so that the correct Cylc executable is always invoked.

The documentation for this has only recently been written, however, you can view it in the nightly build of the documentation.

https://cylc.github.io/cylc-doc/nightly/html/installation.html#managing-environments

Dan_Holdaway · January 18, 2022, 6:18pm

Thanks very much for the quick response and helpful suggestion for the workaround. Sounds good.

hilary.j.oliver · January 24, 2022, 10:04pm

If it’s not obvious from the description above, the wrapper script in the default path should be called cylc. In effect, it intercepts all cylc blah commands to ensure that the right version of Cylc is invoked (according the environment: CYLC_VERSION in Cylc 7; and CYLC_VERSION or CYLC_ENV_NAME in Cylc 8).

Dan_Holdaway · January 25, 2022, 2:20am

Thanks for the clarification @hilary.j.oliver. Using the advice above we were able to get the workflow to complete properly. Thanks again for all the help.

Topic		Replies	Views
Workflow consistently stalling Cylc Support	9	286	April 18, 2023
Debug a stalling suite that uses expiring tasks Cylc 8 Migration	6	271	November 23, 2023
Cylc8b1 dependent jobs not waiting for completion? Cylc Support	2	285	June 14, 2021
Using cylc submit with a non-scheduled task Cylc Support	13	469	May 16, 2021
Errors in global init-script causing tasks to disappear in Cylc 8 Cylc Support	14	319	November 13, 2023

Help with flow file that doesn't fully execute

Related topics