Re-trying failed db transactions

I’ve been adding logging to the cylc DB transactions in rundb.py while trying to run down why our workflows keep crashing.

We’ve found that the failures are due to IO Error returned by SQLite when locking (SQLITE_IOERR_LOCK). The database is hosted on NFS and when Active Directory hiccups we occasionally see locking failures.

2025-04-15T05:21:41Z ERROR - An error occurred when writing to the database
${user}/.service/db, this is probably a filesystem issue.
The error was: disk I/O error
SQLite error code: 3850
SQLite error name: SQLITE_IOERR_LOCK

Looking at cylc-flow/cylc/flow/rundb.py at 53a268d5d50c32a5222c3b3d9b38b3f4a03c477f · cylc/cylc-flow · GitHub I’m wondering if cylc really has to abort.

Could we put these transactions in an exponential backoff retry loop for 5 attempts or so?
In general, I’d like to be resiliant against a wider range of underlying infrastructure hiccups.

I notice that the exception only raises in the case of self.is_public==false and the note in the original change suggests dieing immediately on the public database for integrity reasons. I’m wondering if there’s some middle ground that doesn’t involve immediate death.
Is it reasonable to re-attempt all transactions?
If not, is it reasonable to specifically look for locking errors (others?) and retry on a subset?

If there’s a reasonable path here, my first instict on implementation would be to put the entire try..finally block in a for loop and put in checks for both the exception handling to check for if there are more retries and skip their respective actions if so. I think the finally block should probably be left as closing and re-opening the db connection seems the sanest way to retry cleanly. Seem reasonable?

Cylc has two databases, the “private” and the “public”, both are sqlite3 DBs:

  • The public DB is for downstream applications such as Cylc Review and some Rose functionalities (avoiding the risk of locking the private DB).
  • The private DB is a functional data store used to record Cylc’s progress through the workflow containing events and object states.

We actually do use retries for the public database (code), however, we don’t do this for the private database as write failures could compromise Cylc’s integrity and production robustness.

Cylc is intended to be power-out-safe. That is to say, if you pull the cord on the machine a Cylc scheduler is running on (or kill -9 the scheduler process, or whatever), you should still be able to recover the workflow safely. This is a large part of why Cylc <8 would not move onto the next event loop iteration before flushing DB updates from the last. If we didn’t do this:

  • Job submissions may be missed (and subsequently over-run).
  • User interventions may be missed.
  • Workflow events may be missed (and subsequently diverge).

Additionally, Cylc 8 can potentially both read to and write from the private DB within a single main loop iteration. This reduces the amount of data we have to hold in memory, ensuring that cause and causation are upheld even against distant events (except where we haven’t read in all the right fields).

The filesystem is in effect a message broker, Cylc (and sqlite3 and other things) require it to be consistent for correct function.

Sidenote: Why do we use sqlite3 databases:
  • We and our workflows are already relying on filesystem consistency.
  • Sqlite3 DBs are decentralised (no central service to manage or to go wrong, no direct coupling between schedulers).
  • Sqlite3 is very simple, has low overheads and scales well to our requirements.
  • We have no requirement for multi-threaded DB writes.

As a result, we can’t safely ignore DB write errors and proceed. Moreover, we really shouldn’t be tolerating filesystem bugs as business as usual and attempting to work around them as they shouldn’t be possible in the first place and their consequences can be impossible to predict or safely contain. Note, we allow retries on the public DB to work around and potential locking issues, not to handle FS issues.

I’m wondering if there’s some middle ground that doesn’t involve immediate death.

We could potentially freeze the Cylc event loop and perform a series of retries in a blocking manner. The scheduler would not be responsive during this period, no task events, job submissions, user commands, GUI updates, etc would be processed, commands may time out, but Cylc will eventually recover from message timeouts via polling. If the retries are exhausted, then the scheduler would die. Not terribly keen on working around this (FS integrity is a working assumption), but it’s a possibility.

As an alternative to retrying failed transactions, would it be possible to increase the sqlite connection timeout value? I note that it’s currently set to 0.2 seconds here in rundb.py#L209, and passed to sqlite3.connect here in rundb.py#L455. sqlite3 would otherwise have this value set to 5 seconds.

I’m just starting to get my head around the codebase, so I don’t completely understand the potential impact of increasing this timeout value, though I can see in the original MR that there was some discussion about balancing the risk of db locks with maintaining db integrity (if I’m understanding correctly).

We’re considering a trying out an increase to this value to see whether that reduces the frequency that we’re seeing the issue described above. We’re also in the process of trying to identify the root cause of the issue, but at the same time, if the fix involved making Cylc slightly more resilient to transient infrastructure issues (network latency, I/O slow-downs) etc. then that may be a good outcome.

Hoping for some advice on;
(i) Do we risk database integrity by increasing this timeout value? Are you able to suggest a reasonable maximum value?
(ii) If we do increase this value, and if that does resolve the issue that we’re seeing above, would you be willing to considering having that change made to the Cylc source code? Again, we are aiming to find the root cause, but increased resilience is always nice (if it doesn’t come with too much of a trade off)!

I’m a bit hazy on the relevant internals at this point, but we now update the DB in the main process (not in a background thread) to ensure that the on-disk and in-memory states are consistent (with some jitter, as DB writes are batched to some extent).

So I think a longer connection timeout would just result in the scheduler becoming noticeably unresponsive while waiting on the timeout, if the underlying infrastructure flakes out for a bit.

You can of course tweak that value in your own Cylc installation to test it (if you haven’t done so already?). I’m sure we’d be amenable to making the timeout configurable, if it helps. (However, let’s see if others on the dev team agree with what I’ve said here).

Yes, increasing the timeout to 5 seconds would cause the scheduler to be unresponsive for up to 5 seconds in the event of a timeout. Consequences:

  • Client commands will time out as the default client timeout is also 5 seconds (overridable but not configurable, higher values have consequences for stopped workflow detection).
  • Because Cylc frequently re-connects to the DB these timeouts could repeat with high frequency.
  • Subprocesses will be held up (e.g. job submissions, event handlers, etc).
  • Task statuses may lag.

A shorter timeout, e.g. 1 second (IO really shouldn’t take longer than this), might be more manageable.


would you be willing to considering having that change made to the Cylc source code

It is common for infrastructure issues to be pushed onto Cylc, the codebase now contains various workarounds for several obscure platform issues. Unfortunately, this makes our job harder as the effort of developing features and configurations is much less than the effort of maintaining them. We now have a vast configuration/feature set which often develop complex interactions which generate bugs and bog down development.

So, harmless as a single configuration may seem, we might not be keen to accept workarounds for infrastructure issues to keep our code focused on the job of being a good scheduler. However, we are of course very keen to work towards robustness where we can.

As it stands the current implementation routinely opens and closes the DB which I’m not particularly fond of (but there may be some reason for it having been set up this way):

It’s possible that we could persist the connection instead of opening/closing it. This would reduce the number of connection operations, greatly reducing the likelihood of a connection timeout.

I don’t know if this will help with the issues you’re seeing though as the connection timeout is probably catching the IO issue early, effectively protecting against a write timeout / hang.