Rose task-run - "database is locked"

One of my colleagues recently saw this error in their job.err output.

rose task-run --verbose -O '(bom)' -O '(bom-edc)' '--path=share/fcm_make_*' '--path=share/BINARIES/*' '--path=sh     are/BINARIES/*/scripts' '--path=share/BINARIES/*/bin'
[FAIL] database is locked

I am guessing it is the db that is put in CYLC_TASK_WORK_DIR by rose task-run. If so, then the filesystem in question is lustre, although I can’t tell you what version of lustre unless that is required information as I would have to ask others.

The task is a rose-bunch task, running several a dozen instances of the script in parallel.

Rose version is 2.1.0.

The task was able to rerun successfully after manually being triggered after 3 reruns failed, but I was wondering if this is a known issue and if there was anything we could do to reduce these intermittent failures.

“Database is locked” means that requests were made on the database from multiple processes at the same time and that these requests were blocking.

The Rose Bunch database is used for tracking the states of the “sub-jobs” that it runs. Rose Bunch only accesses the database from the master process so locking should not happen under normal circumstances.

Locking suggests that a second process was trying to access the database, perhaps another job submission from the same task? My best guess at debugging would be to use lsof to see what processes were poking at the database, possibly in an err-script, though catching the issue as it’s happening might be tricky.

I’m not sure how that could happen. The db is in the tasks work directory, and only one task instance can run at a time. Could it happen if a disk is temporarily unresponsive, and the issue is mis-reported, or does the locking error come from very specific logic that a lock is found and not released within a time window?

It isn’t a common occurance, so capturing it isn’t easy. I’m not sure it has been since this one set of incidents.

Cylc won’t do this in of itself, however there is no mechanism to prevent this from happening.

Could it happen if a disk is temporarily unresponsive

Not to the best of my knowledge, we usually see FS issues reported as “Disk IO Error”.


I’m not aware of any other reports of this. Do the workflow / job logs reveal anything of interest?