I’ve been adding logging to the cylc DB transactions in rundb.py while trying to run down why our workflows keep crashing.
We’ve found that the failures are due to IO Error returned by SQLite when locking (SQLITE_IOERR_LOCK). The database is hosted on NFS and when Active Directory hiccups we occasionally see locking failures.
2025-04-15T05:21:41Z ERROR - An error occurred when writing to the database
${user}/.service/db, this is probably a filesystem issue.
The error was: disk I/O error
SQLite error code: 3850
SQLite error name: SQLITE_IOERR_LOCK
Looking at cylc-flow/cylc/flow/rundb.py at 53a268d5d50c32a5222c3b3d9b38b3f4a03c477f · cylc/cylc-flow · GitHub I’m wondering if cylc really has to abort.
Could we put these transactions in an exponential backoff retry loop for 5 attempts or so?
In general, I’d like to be resiliant against a wider range of underlying infrastructure hiccups.
I notice that the exception only raises in the case of self.is_public==false and the note in the original change suggests dieing immediately on the public database for integrity reasons. I’m wondering if there’s some middle ground that doesn’t involve immediate death.
Is it reasonable to re-attempt all transactions?
If not, is it reasonable to specifically look for locking errors (others?) and retry on a subset?
If there’s a reasonable path here, my first instict on implementation would be to put the entire try..finally block in a for loop and put in checks for both the exception handling to check for if there are more retries and skip their respective actions if so. I think the finally block should probably be left as closing and re-opening the db connection seems the sanest way to retry cleanly. Seem reasonable?