Cannot tell if the workflow is running error in failed workflow

deitrr · October 28, 2024, 7:52pm

I’ve just run into an issue with stopping/restarting a workflow in which some tasks failed. I started the workflow on Friday last week and two tasks failed over the weekend due to separate issues: one because I called the wrong script within the task; the other was killed by the sysadmin because I requested too much memory from the job runner (all the tasks in this workflow are using PBS as the job runner).

When I viewed the tree in the GUI this morning, the first task was red (as expected), while the second (killed by sysadmin) was light green/submitted. The workflow was still playing according to the GUI. When I pressed the stop button, I got an error. After closing and reopening the GUI, it is now showing the workflow as stopped, and when I press play I get this error:

 WARNING - Could not parse JSON:

ERROR - Cannot determine whether workflow is running on 10.102.15.227.
    /home/rrd001/miniconda3/envs/cylc_env/bin/python /home/rrd001/miniconda3/envs/cylc_env/bin/cylc play imsi-cylc-cap1
CRITICAL - Cannot tell if the workflow is running
    Note, Cylc 8 cannot restart Cylc 7 workflows.

This is, of course, after fixing the mistakes that led to the two tasks failing and reinstalling the workflow. Unfortunately, I didn’t copy the first error that occurred I pressed stop while the GUI was still showing the workflow as playing–but I think it was the same JSON file warning followed by “Cannot tell if the workflow is running”.

I am able to play other workflows and I can ssh into 10.102.15.227. Is there some kind of cache file that Cylc is reading that got mangled when the job was killed? Or where might the json file be that Cylc is trying to parse?

This is in Cylc v8.3.3.

Thanks in advance,
Russell

deitrr · October 28, 2024, 9:25pm

Just an update: I’ve run into this again on another workflow. That one failed in the traditional way, ie., it wasn’t killed by sysadmin, so that is probably a red herring. More likely then I’ve got some global issue with Cylc happening in my local settings/install or there is some system/machine issue that has nothing to do with Cylc.

wxtim · October 29, 2024, 8:56am

Can you have a look at your .bashrc and .bash_profile on both local and remote system. This is mostly likely caused by an echo statement in one of these places.

oliver.sanders · October 29, 2024, 9:44am

For context…

If a workflow is not contactable (e.g. if it was killed by a sysadmin), Cylc will ssh to the server where the workflow was running to check whether it is still running.

This check returns a JSON message, however, in your case, it would appear that the JSON message was corrupted. As Tim suggested, this can happen if something in your shell profile files is writing to stdout (e.g. echo statements).

Because of this, Cylc cannot tell whether the workflow is still running (but uncontactable, e.g. due to network issues) or not. For safety reasons, Cylc will not restart the workflow until it can confirm that the original process has died (otherwise you could end up with two copies of the workflow fighting over the same resources).

You can override this behaviour by deleting the workflow’s “contact” file. I.E: rm ~/cylc-run/<workflow-id>/.service/contact, however, you should not need to do this under normal circumstances.

deitrr · October 29, 2024, 4:32pm

Thanks @wxtim and @oliver.sanders, this is very helpful!

Is there an easy way to see what is being returned in that corrupted JSON message? I don’t see anything obvious in my shell profile files, but we do have a somewhat unorthodox HPC system here at Environment Canada, with a complicated environment setup. I might be able to locate the offending command if I could see the output. I would like to know where the issue is coming from to prevent it in the future.

Anyway, I can confirm that deleting the contact file allows the workflow to restart.

wxtim · October 29, 2024, 4:48pm

This line should have been followed by whatever could not be parsed:

So my guess is that nothing was returned, which is definately a JSON error. Why nothing was returned… I’m not so sure. Could you try

ssh remote
cylc version --long
cylc psutil --help

oliver.sanders · October 29, 2024, 5:08pm

So my guess is that nothing was returned, which is definately a JSON error.

Yes, I think you’re right, the command ran, and exited 0 (success state), but no text was written to stdout:

Things to try:

Try to run the workflow in --debug mode, this should reveal the exact cylc psutil command being run (and its output I think).
Test the cylc psutil command, e.g. try running ssh <server> cylc psutil <<< '[["Process", 1]]', it should write some JSON to stdout. The exact output isn’t important, we just need to know that it is outputting JSON to stdout.

deitrr · October 29, 2024, 7:10pm

Thanks again @wxtim and @oliver.sanders!

For @wxtim’s suggestion, here is the output:

The above all looks sensible to me–not sure if that tells you anything.

@oliver.sanders’s suggestion: For suggestion 2, I need to run Cylc in a virtual environment so I need to add the activation to this command, like below:

That does produce JSON formatted output. But actually, could the virtual environment be the source of the problem? I am not clear on whether the Cylc environment needs to be activated on the remote machine (i.e., “compute nodes”, where PBS runs the job). Although, Cylc tasks seem to mostly be working without explicitly activating the environment. Could it be that the tasks within the workflow don’t require it (because they don’t execute any Cylc CLI commands) and so run just fine, but this error on restart occurs because the GUI is trying to execute cylc psutil?

I haven’t tried suggestion 1 yet. After deleting the contact file, you mentioned before, I no longer have a “broken” workflow–I can try with a new install, if you think it will reveal anything helpful.

oliver.sanders · October 30, 2024, 9:19am

I am not clear on whether the Cylc environment needs to be activated on the remote machine (i.e., “compute nodes”, where PBS runs the job).

Cylc is a distributed system. It needs to be able to run bits of itself in different locations around the network, e.g. to start the workflow on a Cylc server or submit jobs to a batch system. As a result, Cylc can be installed in a virtual environment (and we recommend this), however, the cylc command needs to be in the default $PATH in order for Cylc to function correctly. Otherwise a command invocation on another host will not inherit from the virtual environment and fail.

The way we recommend doing this is using a “wrapper script” which we provide with the Cylc installation. This is a simple Bash script called “cylc” which you install somewhere in your default $PATH so that it is always there. This script locates the virtual environment where you have installed Cylc and invokes the Cylc command in that environment (which we can do without actually activating the environment itself).

Details on the installation page:

https://cylc.github.io/cylc-doc/stable/html/installation.html#managing-environments

We recommend copying this wrapper script as “rose” and “isodatetime” to provide entry points for these commands too.

deitrr · October 30, 2024, 6:12pm

Oh great! Thanks @oliver.sanders, apologies that I missed this important step in the installation. This is probably the issue, I would guess. I’ll fix this then and follow-up with you if the json error does reappear for some reason.

Topic		Replies	Views
Can't restart workflow Cylc 8 Migration	5	243	April 20, 2023
Restarting failed workflow Cylc Support	35	723	September 26, 2022
Cylc set slow to be able to run after workflow started? Cylc Support	6	38	February 11, 2025
Running Cylc on TGCC’s Irene – Handling Node Changes Between Job Resubmissions Cylc Support	7	39	June 4, 2025
Extending a workflow Cylc 8 Migration	5	284	April 5, 2023

Cannot tell if the workflow is running error in failed workflow

Related topics