Re-trying failed db transactions

I’ve been adding logging to the cylc DB transactions in rundb.py while trying to run down why our workflows keep crashing.

We’ve found that the failures are due to IO Error returned by SQLite when locking (SQLITE_IOERR_LOCK). The database is hosted on NFS and when Active Directory hiccups we occasionally see locking failures.

2025-04-15T05:21:41Z ERROR - An error occurred when writing to the database
${user}/.service/db, this is probably a filesystem issue.
The error was: disk I/O error
SQLite error code: 3850
SQLite error name: SQLITE_IOERR_LOCK

Looking at cylc-flow/cylc/flow/rundb.py at 53a268d5d50c32a5222c3b3d9b38b3f4a03c477f · cylc/cylc-flow · GitHub I’m wondering if cylc really has to abort.

Could we put these transactions in an exponential backoff retry loop for 5 attempts or so?
In general, I’d like to be resiliant against a wider range of underlying infrastructure hiccups.

I notice that the exception only raises in the case of self.is_public==false and the note in the original change suggests dieing immediately on the public database for integrity reasons. I’m wondering if there’s some middle ground that doesn’t involve immediate death.
Is it reasonable to re-attempt all transactions?
If not, is it reasonable to specifically look for locking errors (others?) and retry on a subset?

If there’s a reasonable path here, my first instict on implementation would be to put the entire try..finally block in a for loop and put in checks for both the exception handling to check for if there are more retries and skip their respective actions if so. I think the finally block should probably be left as closing and re-opening the db connection seems the sanest way to retry cleanly. Seem reasonable?

Cylc has two databases, the “private” and the “public”, both are sqlite3 DBs:

  • The public DB is for downstream applications such as Cylc Review and some Rose functionalities (avoiding the risk of locking the private DB).
  • The private DB is a functional data store used to record Cylc’s progress through the workflow containing events and object states.

We actually do use retries for the public database (code), however, we don’t do this for the private database as write failures could compromise Cylc’s integrity and production robustness.

Cylc is intended to be power-out-safe. That is to say, if you pull the cord on the machine a Cylc scheduler is running on (or kill -9 the scheduler process, or whatever), you should still be able to recover the workflow safely. This is a large part of why Cylc <8 would not move onto the next event loop iteration before flushing DB updates from the last. If we didn’t do this:

  • Job submissions may be missed (and subsequently over-run).
  • User interventions may be missed.
  • Workflow events may be missed (and subsequently diverge).

Additionally, Cylc 8 can potentially both read to and write from the private DB within a single main loop iteration. This reduces the amount of data we have to hold in memory, ensuring that cause and causation are upheld even against distant events (except where we haven’t read in all the right fields).

The filesystem is in effect a message broker, Cylc (and sqlite3 and other things) require it to be consistent for correct function.

Sidenote: Why do we use sqlite3 databases:
  • We and our workflows are already relying on filesystem consistency.
  • Sqlite3 DBs are decentralised (no central service to manage or to go wrong, no direct coupling between schedulers).
  • Sqlite3 is very simple, has low overheads and scales well to our requirements.
  • We have no requirement for multi-threaded DB writes.

As a result, we can’t safely ignore DB write errors and proceed. Moreover, we really shouldn’t be tolerating filesystem bugs as business as usual and attempting to work around them as they shouldn’t be possible in the first place and their consequences can be impossible to predict or safely contain. Note, we allow retries on the public DB to work around and potential locking issues, not to handle FS issues.

I’m wondering if there’s some middle ground that doesn’t involve immediate death.

We could potentially freeze the Cylc event loop and perform a series of retries in a blocking manner. The scheduler would not be responsive during this period, no task events, job submissions, user commands, GUI updates, etc would be processed, commands may time out, but Cylc will eventually recover from message timeouts via polling. If the retries are exhausted, then the scheduler would die. Not terribly keen on working around this (FS integrity is a working assumption), but it’s a possibility.