Questions on configuring Cylc for heterogenous systems

We’ve been using Cylc in the same way for a few years now but we’re starting to hit scaling issues and are positive that someone, somewhere has likely already solved our problem…

Currently we run Cylc 7.x monitoring directly on suite hosts (which are just our login nodes). This means we use very slow X11 forwarding for the GUI since not all of our users are comfortable at the command-line.

We also have our cylc-run and cylc-work on Lustre filesystems which normally hasn’t caused any problems but some recent hiccups are showing that SQLite really doesn’t like it when the db suddenly disappears. This usually results in dead suites and the need to restart things by hand. The Lustre filesystems are accessible from all our login nodes, but not from our workstations.

I’m trying to understand from the documentation how to set Cylc up to support running the monitoring (gcylc/gscan) on our local workstations and I’m not clear on exactly what needs to be where. I saw a recent discussion that listed minimum requirements for a suite host are password-less SSH, a global.rc (defining what I’m not sure), and access to the common cylc-run directory. I thought that last point is impossible if you’re monitoring from a separate system but I think I’m missing something.

Was this specific scenario what Rose is/was for? Will this be easier to accomplish with Cylc 8?

Any tips are appreciated!

Yeah, that’ll result in a laggy GUI. A relatively easy solution might be to get proper remote desktop software installed on the login nodes, which avoids X11 forwarding for individual GUI applications.

However, you should be able to run the GUIs on your local workstations with the right network channels open. Have you seen the Cylc 7 documentation on remote control? https://cylc.github.io/cylc-doc/stable/html/running-suites.html#remote-control

Does the DB disappear for good, or temporarily as a result of some kind of temporary filesystem outage?

If the DB disappears, the scheduler will detect that and shutdown. We can’t really avoid that, because the DB holds the persistent workflow state. If it comes back when Lustre straightens itself out, you should be able to do a simple cylc restart to recover. Otherwise, a “warm start” from the beginning of a cycle point will be required.

Will this be easier to accomplish with Cylc 8?

Yes! And cylc-8.0 will be released in July, if things go according to plan. The new UI runs in your browser, which is expected to be remote, and it doesn’t require any local installation of auth files like Cylc 7 does.

In case of DB problems, even if the DB gets nuked, you can (re)start a Cylc 8 workflow from any task or tasks in the graph without reference to previous state (i.e. a “warm start” from a cycle point, which likely reruns some tasks, is no longer necessary).

Cylc 8 will also scale much better in terms of the footprint of individual workflows, because the new scheduling algorithm only has to keep track of active tasks (more or less).

By the way, use of interactive login nodes as suite hosts can lead to scaling problems too, depending on the number and size of the suites, how many GUIs are in use, and other load on the nodes.

We recommend using a small pool of dedicated suite hosts, which can be other non-login nodes on the HPC. Cylc can automatically start schedulers on those nodes, with basic load balancing at start-up. To users on the login nodes, this is completely transparent.

The DB’s come back eventually when the systems detect an issue. We were hoping to avoid manual intervention to restart since we’re looking at managing 50+ suites. We’ve considered using something external to track which suites should be running so we can test against it but we wanted to see if automation could handle it first. We’re using a script right now that looks for left over contact files and assumes those suites were running before they were killed but sometimes the contact files are missing even if a suite was running. I’m not sure why and it didn’t really warrant investigation since we figured we should be tracking our “normal” run some other way.

We are going to start looking at using compute nodes as suite hosts but in the past they had trouble communicating so I may have to dig deeper on that one. It would be ideal though since we’re crushing our login nodes.

I’ll have to see about opening ports for direct connections. Our security posture makes that difficult but it will be a problem we’ll have to face once we move to Cylc 8 anyway since we won’t be able to run web services in one of our primary HPC’s (geologically separate from our monitoring).

One thing that still seems to be a problem though is keeping cylc-run on network storage but I’ll read that remote control section more carefully and see if that answers my questions.

Thanks!

Note that in addition to the web UI, Cylc 8 also has a UI for monitoring and control that runs in the terminal cylc tui (Terminal UI) - for those who can’t get remote web access. It’s currently only suitable for small workflows, but will be upgraded to handle large ones efficiently shortly after 8.0 is released.

Here’s a screen recording of TUI in action (you’ll have to watch past the short demo of workflow installation and cylc scan first):
cylc-8-tui-demo

1 Like

Cylc 8 has workflow installation capability (via cylc install - see the start of the screen demo above). That can be configured to automatically symlink workflow run directories (or individual sub-directories thereof, if you like) to other storage locations. I.e., so that output is still accessible via the standard ~/cylc-run path, but the files are stored elsewhere. This is mainly intended to handle the typical home space disk quota problem, but if might help if you have access to more reliable disk partitions too.

I would suggest getting started with Cylc 8 sooner rather than later, to avoid wasting effort on remote access for the obsolete Cylc 7 GUIs etc. 8.0 will be released later this month!

Hilary

1 Like