Rose task-run - "database is locked"

TomC · March 28, 2024, 1:15am

One of my colleagues recently saw this error in their job.err output.

rose task-run --verbose -O '(bom)' -O '(bom-edc)' '--path=share/fcm_make_*' '--path=share/BINARIES/*' '--path=sh     are/BINARIES/*/scripts' '--path=share/BINARIES/*/bin'
[FAIL] database is locked

I am guessing it is the db that is put in CYLC_TASK_WORK_DIR by rose task-run. If so, then the filesystem in question is lustre, although I can’t tell you what version of lustre unless that is required information as I would have to ask others.

The task is a rose-bunch task, running several a dozen instances of the script in parallel.

Rose version is 2.1.0.

The task was able to rerun successfully after manually being triggered after 3 reruns failed, but I was wondering if this is a known issue and if there was anything we could do to reduce these intermittent failures.

oliver.sanders · March 28, 2024, 9:27am

“Database is locked” means that requests were made on the database from multiple processes at the same time and that these requests were blocking.

The Rose Bunch database is used for tracking the states of the “sub-jobs” that it runs. Rose Bunch only accesses the database from the master process so locking should not happen under normal circumstances.

Locking suggests that a second process was trying to access the database, perhaps another job submission from the same task? My best guess at debugging would be to use lsof to see what processes were poking at the database, possibly in an err-script, though catching the issue as it’s happening might be tricky.

TomC · April 4, 2024, 3:33am

I’m not sure how that could happen. The db is in the tasks work directory, and only one task instance can run at a time. Could it happen if a disk is temporarily unresponsive, and the issue is mis-reported, or does the locking error come from very specific logic that a lock is found and not released within a time window?

It isn’t a common occurance, so capturing it isn’t easy. I’m not sure it has been since this one set of incidents.

oliver.sanders · April 4, 2024, 8:34am

Cylc won’t do this in of itself, however there is no mechanism to prevent this from happening.

Could it happen if a disk is temporarily unresponsive

Not to the best of my knowledge, we usually see FS issues reported as “Disk IO Error”.

I’m not aware of any other reports of this. Do the workflow / job logs reveal anything of interest?

Topic		Replies	Views
'database is locked' issue Cylc Support	3	657	March 11, 2020
Intermittent sqlite3 error? Cylc Support	10	110	January 25, 2024
Re-trying failed db transactions Cylc Support	4	45	May 22, 2025
Rose-prune on remote host? Rose Support	3	246	March 27, 2023
Handling large db sizes with rose-bush Cylc Support	4	434	October 16, 2020

Rose task-run - "database is locked"

Related topics