Help with flow file that doesn't fully execute

Hi,

I am putting together a suite file with Cylc 8 but find that it will only execute the first few tasks and then hangs. The scheduling part of the flow.cylc look like:

[scheduling]

    initial cycle point = {{suite_properties["intial cylc point"]}}
    final cycle point = {{suite_properties["final cylc point"]}}
    runahead limit = {{suite_properties["runahead limit"]}}

    [[graph]]
        R1 = """
             BuildJedi & Stage
             """
        T00,T06,T12,T18 = """
                          GetBackground & GetObservations & JediConfig & BuildJedi[^] & Stage[^]
                          => RunJediExecutable => MergeObsDiags => SaveObsDiags => CleanCycle
                          CleanCycle[-PT6H] => CleanCycle
                          """
        R1/$ = """
               CleanCycle => CleanExperiment
               """

The tasks BuildJedi and Stage are suppossed to run once (and they do) and the rest during each cycle. GetBackground, GetObservations and JediConfig all run. RunJediExecutable appears in the tui but never runs, even once the others are complete. Is there an obvious mistake in the flow? Or perhaps there’s a better way of defining the dependency on the tasks run ahead of the time loop?

Hello,

There’s nothing obviously wrong with the graphing.

The log/workflow/log file should provide an explanation for what’s going on. Perhaps one of the upstream tasks returned a non-zero exit code or there’s a task communication issue.

For a healthy task execution you should find log messages matching the following:

<time> INFO - [<task>] => waiting
<time> INFO - [<task>] => waiting(queued)
<time> INFO - [<task>] => preparing
<time> INFO - [<task>] => submitted
<time> INFO - [<task>] => running
<time> INFO - [<task>] => succeeded

For an unhealthy task execution you might find messages like the following:

<time> INFO - [<task>] => failed  # or submit-failed
<time> WARNING - Incomplete tasks:
      * <task> did not complete required outputs: ['succeeded']
<time> CRITICAL - Workflow stalled

If there is an issue with task-communication (in which case tasks might run but be unable to communicate their status back to the scheduler) you might only see state transitions up to submitted but no further.

You might also see (polled) appearing in the log, this is ok if you are using communication method = poll but could indicate issues otherwise.

https://cylc.github.io/cylc-doc/latest/html/reference/config/global.html#global.cylc[platforms][<platform%20name>]communication%20method

Hi @Dan_Holdaway

Your workflow seems to run fine for me, with assumed initial and final cycle points, and [scheduler]allow implicit tasks = True (so each task runs a local dummy job).

In this sort of situation it’s good to try this first (replace all tasks with local dummy tasks - easiest way is temporarily delete your task definitions and use allow implicit tasks = True) to see that the graph does what you intend, because it doesn’t really matter to Cylc where a task job runs so long as the job status messages can get back to the scheduler.

If the stuck tasks submit their jobs to a different platform than the other tasks do, that suggests a network comms issue as indicated by @oliver.sanders

You can also look for the job.out and job.err files on the job host. If the jobs did run but could not send status messages back you should see clear evidence there.

(That assumes your graph contains only task names, not family names, since family membership is determined by the [runtime] section.)

Hi @hilary.j.oliver and @oliver.sanders. Thanks very much for the help. I think I have tracked the error down. Essentially what seems to have been happening is that in the file <path>/run1/.service/etc/job.sh I have a two lines like:

cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" 'started' &
CYLC_TASK_MESSAGE_STARTED_PID=$!

It doesn’t find cylc at this point, I presume because the environment is not inherited from where the workflow is installed? So even though from the job logs it looked like everything was finishing properly I think the job is seen as failed or still running. The fix comes from putting the module load into my .bash_profile but this doesn’t seem very satisfactory. Is there something better I could do to load my cylc module at the appropriate time? I tried adding the following to the suite with no luck:

[runtime]
    [[root]]
        pre-script = "module load miniconda/3.9-cylc"

Many thanks,
Dan.

Hello,

Jobs cannot always inherit environment from the Scheduler e.g. if they are submitted to batch systems or remote platforms so you will need to ensure the cylc executable is available in the “default” environment.

To do this we recommend using a light-weight “wrapper script” that performs any required activation and installing this wrapper script into a directory in your $PATH, ideally at the administrator level.

For example a simple wrapper script might look something like this:

#!/usr/bin/env bash
# load the environment
module load miniconda/3.9-cylc
# then run the command
exec "$@"

We provide a more advanced wrapper script in the Cylc source code which can handle multiple parallel
Cylc installations at different versions. It can handle installations in virtualenv, Conda and Conda-Pack environments:

Users can then change the version of Cylc they are using by setting the CYLC_VERSION environment variable.

$ cylc version
8.0.0
$ export CYLC_VERSION=7.8.8
$ cylc version
7.8.8

Cylc ensures the relevant environment variables are carried around with jobs and subcommands (including those on remote platforms) so that the correct Cylc executable is always invoked.

The documentation for this has only recently been written, however, you can view it in the nightly build of the documentation.

https://cylc.github.io/cylc-doc/nightly/html/installation.html#managing-environments

Thanks very much for the quick response and helpful suggestion for the workaround. Sounds good.

If it’s not obvious from the description above, the wrapper script in the default path should be called cylc. In effect, it intercepts all cylc blah commands to ensure that the right version of Cylc is invoked (according the environment: CYLC_VERSION in Cylc 7; and CYLC_VERSION or CYLC_ENV_NAME in Cylc 8).

Thanks for the clarification @hilary.j.oliver. Using the advice above we were able to get the workflow to complete properly. Thanks again for all the help.

2 Likes