hi everyone,
just wondering if anyone else has seen cylc
failures of the sort…
sqlite3.OperationalError: unable to open database file
… causing intermittent failures of some tasks?
at this stage i don’t even know if this is a cylc
error or something to do with the platform i’m running on but i vaguely remember sqlite3
having something to do with cylc
?
cheers,
jonny
… fyi also now getting this seemingly-related error…
sqlite3.OperationalError: database is locked
Hi @jonnyhtw -
Yes, the Cylc scheduler maintains an sqilte database as the persistent state (e.g. to enable restart after downtime) of the workflow.
There shouldn’t be any problems with the DB unless some other process is locking it, or (e.g.) filesystem issues or disk quote full.
And even in that case, it should not cause “intermittent failures of some tasks”, because Cylc task jobs do not access the run DB, only the scheduler itself does.
So can you give any more detail on the sequence of events? A good chunk of the scheduler log either side of that error might be helpful.
(You are seeing the error in the log file, I presume?)
1 Like
yes i’m seeing this in the job.err
file.
here’s some scheduler log for submission 3 of the failing compile_lfric_atm
task (which has run successfully many times before)…
2024-01-25T00:26:20Z INFO - [19780901T0000Z/compile_lfric_atm failed job:02 flows:1] => waiting
2024-01-25T00:26:20Z INFO - Command actioned: force_trigger_tasks(['19780901T0000Z/compile_lfric_atm'], flow=['all'], flow_wait=False, flow_descr=None)
2024-01-25T00:26:20Z INFO - [19780901T0000Z/compile_lfric_atm waiting job:03 flows:1] => preparing
2024-01-25T00:26:20Z WARNING - stall timer stopped
2024-01-25T00:26:23Z INFO - [19780901T0000Z/compile_lfric_atm preparing job:03 flows:1] submitted to maui-xc-slurm:slurm[3728135]
2024-01-25T00:26:23Z INFO - [19780901T0000Z/compile_lfric_atm preparing job:03 flows:1] => submitted
2024-01-25T00:26:23Z INFO - [19780901T0000Z/compile_lfric_atm submitted job:03 flows:1] health: submission timeout=P1D, polling intervals=PT15M,...
2024-01-25T00:28:21Z INFO - [19780901T0000Z/compile_lfric_atm submitted job:03 flows:1] => running
2024-01-25T00:28:21Z INFO - [19780901T0000Z/compile_lfric_atm running job:03 flows:1] health: execution timeout=None, polling intervals=12*PT15M,PT1M,PT2M,PT7M,...
2024-01-25T00:30:18Z INFO - [19780901T0000Z/compile_lfric_atm running job:03 flows:1] => failed
2024-01-25T00:30:18Z WARNING - [19780901T0000Z/compile_lfric_atm failed job:03 flows:1] did not complete required outputs: ['succeeded']
2024-01-25T00:30:19Z ERROR - Incomplete tasks:
* 19780901T0000Z/compile_lfric_atm did not complete required outputs: ['succeeded']
2024-01-25T00:30:19Z WARNING - Partially satisfied prerequisites:
* 19780901T0000Z/postproc is waiting on ['19780901T0000Z/lfnctopp:succeeded']
* 19780901T0000Z/coupled is waiting on ['19780901T0000Z/compile_lfric_atm:succeeded']
* 19780901T0000Z/lfnctopp is waiting on ['19780901T0000Z/coupled:succeeded']
2024-01-25T00:30:19Z CRITICAL - Workflow stalled
also i’m nowhere near quota
Are you sure that’s a job.err
file? It’s a scheduler log file!
Also, I can’t see any DB errors there.
And it just shows that you manually triggered a task, and the task failed. Which (not surprisingly) cause the workflow to stall because other tasks are waiting on the success of that one.
(And you should look at the actual job.err file to see why the task failed)
Sorry, my bad!
I re-read above, and you said that’s the scheduler log.
yes i’m seeing this in the job.err
file.
So I presume you mean you see the DB error there? If so, can you post that?
ah ok cool no worries.
here’s the last 20 lines of the job.err
file, thanks
> cylc log u-cy032/run2//19780901T0000Z/compile_lfric_atm -f e|tail -20
Traceback (most recent call last):
File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/DependencyAnalyser", line 79, in <module>
fortranAnalyser.analyse(Path(args.source), macroDictionary)
File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/dependerator/analyser.py", line 322, in analyse
self._database.removeFile(sourceFilename)
File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/dependerator/database.py", line 265, in removeFile
self._database.query(query)
File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/dependerator/database.py", line 119, in query
cursor.executescript(query)
sqlite3.OperationalError: unable to open database file
make[1]: *** [/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/analyse.mk:54: science/src/um/src/atmosphere/convection/det_rate_mod-6a.t] Error 1
make: *** [Makefile:148: build] Error 2
[FAIL] make -C ${SOURCE_DIRECTORY} -j ${MAKE_THREADS} ${TARGET} \
[FAIL] PROFILE=${PROFILE} \
[FAIL] RDEF_PRECISION=${RDEF_PRECISION} \
[FAIL] FIX_ENUMS=${FIX_ENUMS} \
[FAIL] VERBOSE=${VERBOSE} <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=2
2024-01-25T00:34:41Z CRITICAL - failed/ERR
OK, thanks, that was helpful.
The traceback shows that’s not Cylc code, and therefore it’s not your Cylc workflow DB that’s the problem.
It seems your task is running a program that uses an sqlite DB for its own nefarious reasons!
1 Like
ah ok thanks for that, in that case i have no idea what’s going on then haha. this workflow ran past this task last week! cheers