I’ve just run into an issue with stopping/restarting a workflow in which some tasks failed. I started the workflow on Friday last week and two tasks failed over the weekend due to separate issues: one because I called the wrong script within the task; the other was killed by the sysadmin because I requested too much memory from the job runner (all the tasks in this workflow are using PBS as the job runner).
When I viewed the tree in the GUI this morning, the first task was red (as expected), while the second (killed by sysadmin) was light green/submitted. The workflow was still playing according to the GUI. When I pressed the stop button, I got an error. After closing and reopening the GUI, it is now showing the workflow as stopped, and when I press play I get this error:
WARNING - Could not parse JSON:
ERROR - Cannot determine whether workflow is running on 10.102.15.227.
/home/rrd001/miniconda3/envs/cylc_env/bin/python /home/rrd001/miniconda3/envs/cylc_env/bin/cylc play imsi-cylc-cap1
CRITICAL - Cannot tell if the workflow is running
Note, Cylc 8 cannot restart Cylc 7 workflows.
This is, of course, after fixing the mistakes that led to the two tasks failing and reinstalling the workflow. Unfortunately, I didn’t copy the first error that occurred I pressed stop while the GUI was still showing the workflow as playing–but I think it was the same JSON file warning followed by “Cannot tell if the workflow is running”.
I am able to play other workflows and I can ssh into 10.102.15.227. Is there some kind of cache file that Cylc is reading that got mangled when the job was killed? Or where might the json file be that Cylc is trying to parse?
Just an update: I’ve run into this again on another workflow. That one failed in the traditional way, ie., it wasn’t killed by sysadmin, so that is probably a red herring. More likely then I’ve got some global issue with Cylc happening in my local settings/install or there is some system/machine issue that has nothing to do with Cylc.
Can you have a look at your .bashrc and .bash_profile on both local and remote system. This is mostly likely caused by an echo statement in one of these places.
If a workflow is not contactable (e.g. if it was killed by a sysadmin), Cylc will ssh to the server where the workflow was running to check whether it is still running.
This check returns a JSON message, however, in your case, it would appear that the JSON message was corrupted. As Tim suggested, this can happen if something in your shell profile files is writing to stdout (e.g. echo statements).
Because of this, Cylc cannot tell whether the workflow is still running (but uncontactable, e.g. due to network issues) or not. For safety reasons, Cylc will not restart the workflow until it can confirm that the original process has died (otherwise you could end up with two copies of the workflow fighting over the same resources).
You can override this behaviour by deleting the workflow’s “contact” file. I.E: rm ~/cylc-run/<workflow-id>/.service/contact, however, you should not need to do this under normal circumstances.
Is there an easy way to see what is being returned in that corrupted JSON message? I don’t see anything obvious in my shell profile files, but we do have a somewhat unorthodox HPC system here at Environment Canada, with a complicated environment setup. I might be able to locate the offending command if I could see the output. I would like to know where the issue is coming from to prevent it in the future.
Anyway, I can confirm that deleting the contact file allows the workflow to restart.
So my guess is that nothing was returned, which is definately a JSON error.
Yes, I think you’re right, the command ran, and exited 0 (success state), but no text was written to stdout:
Things to try:
Try to run the workflow in --debug mode, this should reveal the exact cylc psutil command being run (and its output I think).
Test the cylc psutil command, e.g. try running ssh <server> cylc psutil <<< '[["Process", 1]]', it should write some JSON to stdout. The exact output isn’t important, we just need to know that it is outputting JSON to stdout.
The above all looks sensible to me–not sure if that tells you anything.
@oliver.sanders’s suggestion: For suggestion 2, I need to run Cylc in a virtual environment so I need to add the activation to this command, like below:
That does produce JSON formatted output. But actually, could the virtual environment be the source of the problem? I am not clear on whether the Cylc environment needs to be activated on the remote machine (i.e., “compute nodes”, where PBS runs the job). Although, Cylc tasks seem to mostly be working without explicitly activating the environment. Could it be that the tasks within the workflow don’t require it (because they don’t execute any Cylc CLI commands) and so run just fine, but this error on restart occurs because the GUI is trying to execute cylc psutil?
I haven’t tried suggestion 1 yet. After deleting the contact file, you mentioned before, I no longer have a “broken” workflow–I can try with a new install, if you think it will reveal anything helpful.
I am not clear on whether the Cylc environment needs to be activated on the remote machine (i.e., “compute nodes”, where PBS runs the job).
Cylc is a distributed system. It needs to be able to run bits of itself in different locations around the network, e.g. to start the workflow on a Cylc server or submit jobs to a batch system. As a result, Cylc can be installed in a virtual environment (and we recommend this), however, the cylc command needs to be in the default $PATH in order for Cylc to function correctly. Otherwise a command invocation on another host will not inherit from the virtual environment and fail.
The way we recommend doing this is using a “wrapper script” which we provide with the Cylc installation. This is a simple Bash script called “cylc” which you install somewhere in your default $PATH so that it is always there. This script locates the virtual environment where you have installed Cylc and invokes the Cylc command in that environment (which we can do without actually activating the environment itself).
Oh great! Thanks @oliver.sanders, apologies that I missed this important step in the installation. This is probably the issue, I would guess. I’ll fix this then and follow-up with you if the json error does reappear for some reason.