Cylc8 migration issues

hilary.j.oliver · September 2, 2023, 10:08pm

(Actually that must be the case, since module has to configure your current environment, not that of a subshell).

fredw · September 5, 2023, 7:50pm

Hello Oliver,

Please excuse the frustrated tone of my messages here. As you can tell I’ve not found it easy to adjust to cylc8. My comment on the documenation is that it appears to be written with cylc developers in mind, as a reference guide. I, as a user, find it hard to understand the new concepts, like platform. While this is something I have figured out now (through trial and error), here’s an example of how the documentation could be improved. I would have found it easier to see questions addressed that a new user might have, like “What are the platform definuitions for?”, “In which scenario would I need to use them?” “What’s an example for when platform-groups are useful over a simple platform”. You see - the examples are missing. The user has a setup like XYZ, hence he’d use this config to make it work on that setup, and another example for a different setup.
I could help improve those parts of the documenation that I found difficult to understand, and add simplified examples, once I have solved my case here.

Regarding your comment

I don’t think this is incorrect, my slurm job scripts use relative paths too. Cylc should run this command from the $HOME directory too. Are you changing directory in any of your shell startup files?

I found this surprising. My experience in cylc is that tasks are run with the working directory as their CWD, and I confirmed this my adding the line:

    [[root]]
        pre-script = """
                echo -n "PWD=" && pwd
...

and the log file indeed reports the working directory, e.g.

Workflow : downloader_suite_cl4/run12
Job : 20230818T0000Z/build_uscc_engine/01 (try 1)
User@Host: cylc@metoc-cl4

2023-09-05T19:34:50Z INFO - started
PWD=/home/cylc/cylc-run/downloader_suite_cl4/run12/work/20230818T0000Z/build_uscc_engine

And I thought that was a feature of cylc - a job has its working directory as its current working directory.

So I don’t know how to solve the issue with the paths of the job.out, job.err files.

I confirmed I am not changing any paths in my shell startups by running this on the cylc host

$ ssh cylc@metoc-cl4 pwd
/home/cylc

metoc-cl4 is the job host (cluster headnode). So I am at a loss. It looks like it’s nothing to do with cylc, but it’s cylc that creates the job script and uses relative paths in it. Could that be changed? Could I change it by fixing the code? It seems pretty fundamental to see the job.out/job.err files especially as I’m still in the debugging phase trying to get to grips with cylc8.

Thanks
Fred

hilary.j.oliver · September 5, 2023, 10:36pm

Documentation is always a work in progress, but I hope you don’t mind if we push back a little, just to make sure you have found the relevant existing documentation.

The Cylc 8 migration guide gives a short explanation of the new platforms concept:

https://cylc.github.io/cylc-doc/stable/html/7-to-8/major-changes/platforms.html#what-is-a-platform

A “platform” represents one or more hosts from which jobs can be submitted to or polled from a common > job submission system.

If a platform has multiple hosts Cylc will automatically select a host when needed and will fallback to other hosts if it is not contactable.

In terms of job management, a platform is a set a hosts, any of which Cylc can use to interact with jobs. This is more robust and reliable than Cylc 7, which only knew about single “job hosts”.

Additionally, Cylc 8, unlike Cylc 7, automatically handles installation of workflow files to the run directory on the scheduler and job hosts. That requires the concept of a platform “install target”, which is the single host on a shared filesystem to use for file installation.

(Maybe the documentation should include those slightly more detailed explanations … I’ll take a look)

There is also the main platforms configuration guide Platform Configuration — Cylc 8.2.2 documentation
with sub-sections:

What Are Platforms?
Why Were Platforms Introduced?
What Are Install Targets?
Example Platform Configurations

Did you see those?

That is certainly not the intention, but the documentation is admittedly written by the developers (we’re open to others contributing too though!). And it has to cater to all of the following types of reader:

individuals who want to install and use Cylc on their own personal machines
admin people who need to install and configure multiple Cylc versions for many users at a complex HPC site
users who need to configure and run their own workflows at such a site

Platforms configuration is not really something that “normal users” should have to do, although they can if it hasn’t been done centrally for everyone at the site.

That would be really helpful - please do!

dpmatthews · September 6, 2023, 6:43am

Yes, that’s correct - each task has it’s own working directory which the job uses as CWD. However, the jobs are submitted from the $HOME directory so the relative paths specified in the job file are relative to $HOME.

fredw · September 6, 2023, 7:41pm

Sadly your comment regarding the directory slurm uses doesn’t work like that in my setup, so I can’t run any slurm jobs at the moment - it bombs out, presumably because it doesn’t find the path where it’s supposed to put the out/err files.

hilary.j.oliver · September 6, 2023, 9:35pm

In what sense does it “bomb out” exactly? Job submission failure, with Cylc; or job execution failure because it can’t write the the log files; or just that the log files don’t show up as expected? Do you get any error messages, either from Cylc or Slurm?

If you ssh to the host in question, what directory do you land in? It should be your home directory.

fredw · September 12, 2023, 7:03pm

Hi Hilary,
Thanks for your response. I really value your time trying to help.

I’ve continued working on porting the suites and the slurm issues have gone away. I don’t now what I did and what changed, but I now get the job.err/out files where they should be.

My problem now is that I only have terminal access to the cylc host, and thus can only use cylc tui. It’s a really worthwile addition to the toolkit, but as the warning suggests it might break with large suites, which sadly it does. I am getting the error message

Timeout waiting for server response. This could be due to network or server issues. Check the workflow log.

I don’t see anything relating to timeouts in the scheduler log, and the warning appears within milliseconds of opening tui, when nothing could have technically timed out yet.

So I was wondering - Is tui currently being developed at the moment? Where should I submit a bug report?

Thanks for all your help so far,
Fred

hilary.j.oliver · September 12, 2023, 10:16pm

Good news on the Slurm front

Can you check that cylc tui works in your environment for a small workflow?

If it does, we can definitively blame the problem behind the “may break with large workflows” warning. TUI is not hooked up to the incremental “delta” subscription for workflow state data from the UI Server, that the web UI gets, for technical reasons that might take a while yet to overcome, so it has to get inefficient global workflow state updates.

However, you may be in luck. @oliver.sanders recently tried using Python multiprocessing to separate the data feed from the terminal display main loop, which looks like it’s going to work well even with the current global data updates. I shouldn’t over-promise, but we might get this into the 8.3.0 release.

The platform admins will not allow https, for cylc gui to your local browser?

If not, until TUI is released with the aforementioned update, I guess you’ll just have to use the CLI, tail the scheduler log (e.g. cylc log -m tail <workflow-id> ), and a primitive roll-your own monitor e.g.:

$ watch -n 5 "cylc dump -t <workflow-id>"

dpmatthews · September 13, 2023, 6:40am

If port fowarding is enabled (it often isn’t) you can use it to access the GUI in your local browser. e.g.

$ ssh -t -L 8888:localhost:8888 cylc-host 'bash -l -c "cylc gui --no-browser --Application.log_level=WARN"'

This will report a URL which can be opened locally.

fredw · October 4, 2023, 4:41pm

Thank you very much, it took me a while to really take in how this works, but I got the GUI running on my local machine now using the port-forwarding you described.

Thanks a lot!!

fredw · October 4, 2023, 4:47pm

Yes, cylc tui works for small test suites. But it breaks for the described reasons for my full-blown production suite. So I’ll wait for cylc 8.3 for the updated tui.

Sadly cylc dump also breaks in some cases with the same error as cylc tui, but probably due to the limitations of the same internal engine. Again. I’ll wait for cylc 8.3

As per my reply to @dpmatthews I now got the port-forwarding working, so I can see cylc gui in my local browser, which is excellent news. So I’ll figure out the next steps in my porting journey by using the gui, and create a new ticket for any issues I face.

Thank you all so much for your input, and all of your help. I’m nearly there with the cylc7 → cylc8 porting, maybe just a few more weeks and it should finally run in cylc8.

oliver.sanders · October 5, 2023, 9:59am

What error are you getting?

The only reason that cylc dump should break, is if it takes over five seconds for the scheduler to send back a response.

You can override this timeout using the --comms-timeout option e.g:

$ cylc dump --comms-timeout=10

However, it really shouldn’t take more than 5 seconds to receive a response, I would expect it to take less than 1 second. Is the system under especially heavy load?

fredw · October 5, 2023, 10:28am

This is the error message:

$ cylc dump suite-name -t
ClientTimeout: Timeout waiting for server response. This could be due to network or server issues. Check the workflow log.

But the error doesn’t appear instantly (as with tui), it takes about 5 or 6 seconds to appear.

fredw · October 5, 2023, 10:43am

oliver.sanders:

$ cylc dump --comms-timeout=10
However, it really shouldn’t take more than 5 seconds to receive a response, I would expect it to take less than 1 second. Is the system under especially heavy load?

A longer comms-timeout does indeed solve the issue. There are so many outstanding tasks, that 30 seconds didn’t suffice, so I wenty all the way to 600 seconds and I got the complete (and very long) task list.

Thanks!

oliver.sanders · October 5, 2023, 10:46am

Thanks for the update, I haven’t seen cylc dump timeout before, must be a monster of a workflow!

I’ll look at increasing the default timeout for cylc dump but won’t set it as high as 600 seconds.

fredw · October 5, 2023, 10:56am

The returned list of tasks is 10303 lines long. That’s not normal and mainly a result of the changed behaviour of cylc8 regarding run ahead. I have this in most of my suites:

runahead limit = P10D

I tend to use either 7D or 10D.
But if I start a workflow on a date in the past it spawns tons and tons of tasks aleady that shouldn’t be there. In my current example the most recent cycle point is 20231125T0000Z, while the most recent cycle point with uncomplete tasks is at 20231003T0000Z. This seem very wrong - it shouldn’t go beyong 20231013T0000Z in my case - only run ahead by 10 days.

Was there a change concerning the run ahead in cylc8 I’m not aware of?

hilary.j.oliver · October 5, 2023, 9:13pm

The default runahead limit is now 5 cycles, not 3 (but you’re not using the default).

In general (but depending somewhat on workflow structure) the Cylc 8 scheduler has to maintain an awareness of many fewer tasks at once than Cylc 7 did - only the current active tasks (where “active” really means “ready to run as far as the task graph is concerned” - but some of those can be in the waiting state, waiting on clock-triggers, xtriggers, and queues).

Tasks with no parents in the graph (which may however have clock or external triggers) are technically “ready to run according to the task graph” out to the end of the graph (or forever, if there is no end to it) - these should spawn out to the runahead limit and start checking on their xtriggers.

What you see in the output of cylc dump -t (and the GUI) is a window extending 1 edge out from the active tasks in the graph, which may go beyond the runahead limit - but by default, only by 1 edge.
In Cylc 8 how much of the graph you see at runtime, around active tasks, is just a visualization choice.

Are you able to post a (very!) cut-down version of your workflow that exhibits the reported behaviour, for us to take a look at in case something is going wrong?

Topic		Replies	Views
[runtime][<task name>][job]batch system equivalent in cylc 8.3.4 Cylc Support	14	86	November 14, 2024
Task wrangling issues Cylc 8 Migration	16	713	August 26, 2022
Platfrom setup in global.cylc Cylc Support	15	43	August 25, 2025
Cylc 8.0b1 task communication fails on PBS Cylc Support	15	571	June 8, 2021
Setting up a Platform Config for machines that share a home directory Cylc 8 Migration	35	1157	March 17, 2022

Cylc8 migration issues

Related topics