Run1, run2, ...,runK,...,runN

cylc8.
recently, the node on which the workflow was running was rebooted.
i/m not sure if the user did ‘cylc vip .’ again or not.
cylc scan indicated the workflow was running.
cylc tui however, after ‘loading’, only showed the users name.

the user finally stopped the workflow and started over, deleting cylc-run/workflow and using ‘cylc vip .

a discussion ensued:
the user is annoyed with all the run1, run2, … (many runK)
and wanted to know if they were all necessary.

given that system folks make (unadvertised) changes, workflows fail, and users restart them sometimes creating clutter,

i am asking the forum for advice to pass on to the user.

Hi @schaferk,

You only get a new numbered run directory if you install a new copy of the workflow, to start a new run from scratch.

After a reboot, or whatever, that kills a workflow, you should typically restart the same run in-place, which does not create a new run directory.

The reasons for numbered run directories are:

  • Safety: it is dangerous to start a new run from scratch inside an existing run directory
    • (It will rerun all the same tasks, which have already altered the run directory).
    • (It will destroy your workflow state, if you actually intended a restart!)
  • Convenience: if starting a new run from scratch, no need to think of a naming strategy
    • (Although you can provide run names if you prefer - see cylc install --help).

So users will only end up with a “clutter” of numbered runs if they deliberately install many new copies of the workflow to do new runs from scratch. Reasons for this could include:

  • Actually needing to do many separate runs of your workflow.
    • (Then numbered runs are good - they automatically and cleanly separate each run).
  • repeat-running from scratch to get things working during early workflow development.
    • Numbered runs in this case may be unnecessary, so just clean them up.

Make sure your users are aware of cylc clean, for removing old run directories.

See:

  • cylc install --help - install a new run from source
  • cylc reinstall --help - update an installed run from source
  • cylc vip --help - (validate and) install and play a new run
  • cylc play --help - play an installed run
  • cylc clean --help - remove installed runs

thank you H. for the thoughtful reply.

recall the node was rebooted and the user promptly issued cylc vip . to create run2.
we have two workflows; WORKFLOW/run1, and WORKFLOW/run2 both running according to cylc scan;
cylc tui for run2 shows cylc point et.al, but run1 only shows ‘Loading’ and the user name.

however when ‘stop’ is attempted (cylc stop WORKFLOW/run1, with variants --now and --kill and --debug) we get:

cylc.flow.exceptions.CylcError: Cannot determine whether workflow is running on X.Y.Z.W
/home/.conda/envs/cylc8/bin/python /home/.conda/envs/cylc8/bin/cylc vip .

here is DEBUG excerpt:

DEBUG - Reading file /home/.cylc/global.cylc
WARNING - $ ssh -oBatchMode=yes -oConnectTimeout=10 X.Y.Z.W env CYLC_VERSION=8.4.2 CYLC_CONF_PATH=/home/.cylc
bash --login -c ‘exec “$0” “$@”’ cylc psutil # returned 126

in run1/.service I find
SCHEDULER_SSH_COMMAND=ssh -oBatchMode=yes -oConnectTimeout=10

how do I clean this up without just wiping and starting over?

OK. I can answer my own question.
Start, Restart, Reload — Cylc 8.4.2 documentation states that The contact file gets removed automatically at shutdown (assuming the scheduler shuts down cleanly). I don/t know if the reboot qualifies. so I removed the contact and all is well.

That’s right, a reboot or (e.g.) kill -9 <scheduler-pid> will leave a contact file behind that makes it look to cylc scan as if the workflow is still running.

However, client commands (including cylc scan -t rich) that need to connect to the scheduler will delete the orphaned contact file if they find it is not actually running.

cylc.flow.exceptions.CylcError: Cannot determine whether workflow is running on X.Y.Z.W
/home/.conda/envs/cylc8/bin/python /home/.conda/envs/cylc8/bin/cylc vip

This indicates that the client command (cylc stop in this case) cannot connect to the host that the contact file says the workflow is running on. So it can’t stop the workflow if it is running, or determine that it isn’t running and hence safely delete the contact file.

You might want to figure out why that is happening.