I’ve moved this comment to its own thread to avoid cluttering the UI Feedback thread with unrelated discussion.
My original comment:
The UI design [for Cylc 8] looks fantastic for how we monitor our run and I’m very excited for it. That said, I don’t like seeing “SQLite” in the architecture diagram… I would suggest switching to an ORM (such as SQLAlchemy) or at least allow configuration of a proper DB so we can easily tap into the Cylc databases without relying on a lot of disk I/O to read a bunch of SQLite files. That would really help us integrate Cylc into our end-to-end monitoring.
Edit: and yes, I’ve seen bug #2800 on github and I disagree with the resolution! SQLite incurs an I/O penalty when you’re querying 25+ suites on say GPFS, or at least our variant of GPFS. LustreFS wasn’t much better… Storing the cylc-run dir on a local disk was helpful but we don’t like the risk of that box dying and taking our entire run state with it.
Reply from Hilary (Cylc users: what do you think about the current, & proposed new, Cylc (G)UI?):
Thanks, it’s encouraging to hear people excited for the new UI.
On the database side, perhaps bump this to another thread to avoid sidetracking the GUI discussion but quickly…
I’m curious as to why you are querying the DBs of multiple suites. Are you trying to gather stats for multiple workflows to create a “total system” Cylc GUI? If so it would be great to capture these requirements for Cylc UI development. This sort of usage may intercept with some of our plans.
Cylc provides event “hooks” for suites (e.g. startup, shutdown) and tasks (e.g. started, failed). Some use these hooks to drive end-to-end monitoring, I would have thought this would be the best solution (push vs pull).
Alternatively the Cylc UI Server in the new architecture may provide a good DB proxy for your usage. The query language is GraphQL rather than SQL, through the UI Server you could write requests which gather data from multiple suites.
There are a few things I think using a central database would help with:
- It could improve suite_state xtrigger performance. This is the primary reason for my raising this concern. When we’re running 20+ suites in Cylc 7.8.3, the suite server (the actual machine Cylc is running on) becomes nearly unresponsive unless we increase the suite_state check interval to an unhelpfully large number (5 minutes or more). Our run is very time sensitive and 5 minutes can be the difference between on time and late.
- It would allow external reporting tools easy access to Cylc metrics. This would take the workload off the Cylc community to try and write something from scratch. Just give us access to the data and let us use our own tools (Splunk, in our case). Otherwise we’re probably still going to have to write wrappers, no matter now nice your solution might be. Getting software approved for our use is not fast or easy so using something standard like a PostgreSQL database lets us piggy back on things we already have.
- It would allow Cylc to be less centralized. We could run the UI and server components in a cloud web service, have the database on a cloud DB, and be able to run jobs literally anywhere with the run states being well protected and backed up. It would also keep us from having to maintain a separate run database based on event triggers defined in each and every suite for our run metrics.
Edit: I reread Hilary’s response and to speak directly to his two suggestions:
Use events to push state to monitoring tools: We would just use whatever UI Cylc provides for monitoring. My suggestion is based more on analytics which run a fixed schedule and may not need to be that up-to-date.
Use GraphQL interface of the Cylc UI server: This might be a possible solution. We’d prefer to just tap into PostgreSQL but we could write a scraper in GraphQL to ingest data from the Cylc UI server…