`cylc install` takes a long time and produces lots of output

I’ve been having a problem recently where cylc install takes a very long time. I get output like this:

$ cylc install       
INSTALLED sami3_assim-cylc/run249 from /p/home/jhaiduce/sami3_assim-cylc/cylc/sami3_assim-cylc
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Cannot determine whether workflow is running on mustang03.ib0.icexa.afrl.hpc.mil.
    /p/home/jhaiduce/sami3_assim-cylc/build/venv/bin/python3 /p/home/jhaiduce/sami3_assim-
    cylc/build/venv/bin/cylc play sami3_assim-cylc
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Cannot determine whether workflow is running on mustang07.ib0.icexa.afrl.hpc.mil.
    /p/home/jhaiduce/sami3_assim-cylc/build/venv/bin/python3 /p/home/jhaiduce/sami3_assim-
    cylc/build/venv/bin/cylc play sami3_assim-cylc
WARNING - Cannot determine whether workflow is running on mustang04.ib0.icexa.afrl.hpc.mil.
    /p/home/jhaiduce/sami3_assim-cylc/build/venv/bin/python3 /p/home/jhaiduce/sami3_assim-
    cylc/build/venv/bin/cylc play sami3_assim-cylc/runN
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Cannot determine whether workflow is running on mustang04.ib0.icexa.afrl.hpc.mil.
    /p/home/jhaiduce/sami3_assim-cylc/build/venv/bin/python3 /p/home/jhaiduce/sami3_assim-
    cylc/build/venv/bin/cylc play sami3_assim-cylc
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Cannot determine whether workflow is running on mustang08.ib0.icexa.afrl.hpc.mil.
    /p/home/jhaiduce/sami3_assim-cylc/build/venv/bin/python3 /p/home/jhaiduce/sami3_assim-
    cylc/build/venv/bin/cylc play sami3_assim-cylc
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.
WARNING - Found contact file with incomplete data:
    'CYLC_WORKFLOW_PID'.

It looks like cylc is going through a lot of old runs when installing. I don’t recall this happening with previous versions (I first saw this some time after upgrading from cylc 8.0.1 to 8.2.0). Is there anything I can do to speed up execution of cylc install?

This looks like a feature introduced in Cylc 8.1.0: Scan workflow name during install. by hjoliver · Pull Request #5184 · cylc/cylc-flow · GitHub

I guess we should probably avoid this scan if there are more than a certain number of existing installed run dirs.

P.S. do you really have 249 run dirs still installed for that workflow? :hushed:

Yes, I do. I guess that’s not typical? Do people usually archive their old run dirs somehow?

We would recommend deleting old runs that are no longer needed using cylc clean. E.g.

cylc clean sami3_assim-cylc/run1

You can use glob patterns (make sure to quote the argument) e.g.

cylc clean "sami3_assim-cylc/run1?[0-9][0-9]"

should clean everything up to run200 I think.

It sounds like cylc clean actually deletes the runs and their output, so if I want to keep a copy for later analysis I should do that before running cylc clean, correct? Or if I want to keep them but not have them impact cylc performance is it better to use the mv command to move them out of the cylc run directory so cylc doesn’t see them (since that would be faster than doing a copy)?

Yes, if you want to keep them it is probably a good idea to move them out of cylc-run. We do have plans to add an archiving option to cylc clean in future, however we don’t have a timescale for this at present: File housekeeping utility. · Issue #1159 · cylc/cylc-flow · GitHub

I’ll do that, thanks!

@jhaiduce -

cylc install scans for running workflows so that it can tell you if you already have instances of the same workflow running (sometimes people forget).

Typically, in real runs there will be tasks that archive important generated data to another location as the workflow runs. If the run is finished and the run directory no longer needed, clean it up with cylc clean.

If you want to keep a run directory for analysis, as you say, it can be convenient to leave it under cylc-run so that you can continue to use cylc log, and the GUI, to view scheduler and job logs. But be aware that Cylc has to consider any valid run directory as a stopped workflow that could in principle be restarted.

If you’re just repeatedly doing test runs during development, say, you might as well routinely tidy up with cylc clean between runs.

The workflow scanning process will be slowed down by orphaned or corrupted contact files. On normal shutdown, a contact file (which stores host, port, PID, etc. for a running scheduler) will be removed automatically.

If anything prevents that clean-up, the scan process will time out trying to contact the no-longer-running scheduler. If it can determine that the scheduler is no longer running (by PID on the host) it will delete the contact file, making the scan much faster next time… unless the contact file is corrupted in such a way that it is impossible to verify that the scheduler is running or not, in which case for safety reasons the contact file will not be delete - it looks like this is happening in your case. I don’t know how you would end up with incomplete PID info. But whatever it was, you should find all contact files under cylc-run and delete them if the corresponding workflows are not running.

Most of my runs end without a normal shutdown…often they get stuck when ssh commands fail, usually due to expired authentication tokens but sometimes for some other reason (login node on the machine not responding for instance). I regularly end up killing the scheduler process with kill because cylc stop doesn’t seem to shut down the flow in a timely manner even when I pass the --kill and --now --now options.

It happens to me pretty often, apparently. Maybe related to the aforementioned lack of normal shutdowns?

That shouldn’t happen. Have you tried stop-now under normal circumstances, or only after networks issues etc.? We occasionally see schedulers become unresponsive after filesystem problems (e.g. disk quota exceeded). In that case the only solution is kill it and restart (which is safe - the scheduler will start up where it left off in the workflow). But you should clean up the contact file manually after such a kill, if you’re not just going to restart immediately.

Most of the times I’ve used cylc stop were on a flow that was having trouble of some sort. I’m not totally sure I’ve tried it under normal circumstances before. It did work for me just now on a flow that was running normally.

Sounds like I need to go delete a bunch of contact files…

After deleting all the stale contact files, the cylc install is dramatically faster and produces a lot less output:

$ cylc install
INSTALLED sami3_assim-cylc/run250 from /p/home/jhaiduce/sami3_assim-cylc/cylc/sami3_assim-cylc
NOTE: 1 run of "sami3_assim-cylc" is already active:
  ◧ sami3_assim-cylc/run249 mustang10.ib0.icexa.afrl.hpc.mil:43050 80332
You can stop it with:
  cylc stop sami3_assim-cylc/run249
See "cylc stop --help" for options.

Thanks @MetRonnie and @hilary.j.oliver!

2 Likes