Intermittent sqlite3 error?

hi everyone,

just wondering if anyone else has seen cylc failures of the sort…

sqlite3.OperationalError: unable to open database file

… causing intermittent failures of some tasks?

at this stage i don’t even know if this is a cylc error or something to do with the platform i’m running on but i vaguely remember sqlite3 having something to do with cylc? :cowboy_hat_face:

cheers,

jonny

… fyi also now getting this seemingly-related error…

sqlite3.OperationalError: database is locked

Hi @jonnyhtw -

Yes, the Cylc scheduler maintains an sqilte database as the persistent state (e.g. to enable restart after downtime) of the workflow.

There shouldn’t be any problems with the DB unless some other process is locking it, or (e.g.) filesystem issues or disk quote full.

And even in that case, it should not cause “intermittent failures of some tasks”, because Cylc task jobs do not access the run DB, only the scheduler itself does.

So can you give any more detail on the sequence of events? A good chunk of the scheduler log either side of that error might be helpful.

(You are seeing the error in the log file, I presume?)

1 Like

yes i’m seeing this in the job.err file.

here’s some scheduler log for submission 3 of the failing compile_lfric_atm task (which has run successfully many times before)…

2024-01-25T00:26:20Z INFO - [19780901T0000Z/compile_lfric_atm failed job:02 flows:1] => waiting
2024-01-25T00:26:20Z INFO - Command actioned: force_trigger_tasks(['19780901T0000Z/compile_lfric_atm'], flow=['all'], flow_wait=False, flow_descr=None)
2024-01-25T00:26:20Z INFO - [19780901T0000Z/compile_lfric_atm waiting job:03 flows:1] => preparing
2024-01-25T00:26:20Z WARNING - stall timer stopped
2024-01-25T00:26:23Z INFO - [19780901T0000Z/compile_lfric_atm preparing job:03 flows:1] submitted to maui-xc-slurm:slurm[3728135]
2024-01-25T00:26:23Z INFO - [19780901T0000Z/compile_lfric_atm preparing job:03 flows:1] => submitted
2024-01-25T00:26:23Z INFO - [19780901T0000Z/compile_lfric_atm submitted job:03 flows:1] health: submission timeout=P1D, polling intervals=PT15M,...
2024-01-25T00:28:21Z INFO - [19780901T0000Z/compile_lfric_atm submitted job:03 flows:1] => running
2024-01-25T00:28:21Z INFO - [19780901T0000Z/compile_lfric_atm running job:03 flows:1] health: execution timeout=None, polling intervals=12*PT15M,PT1M,PT2M,PT7M,...
2024-01-25T00:30:18Z INFO - [19780901T0000Z/compile_lfric_atm running job:03 flows:1] => failed
2024-01-25T00:30:18Z WARNING - [19780901T0000Z/compile_lfric_atm failed job:03 flows:1] did not complete required outputs: ['succeeded']
2024-01-25T00:30:19Z ERROR - Incomplete tasks:
      * 19780901T0000Z/compile_lfric_atm did not complete required outputs: ['succeeded']
2024-01-25T00:30:19Z WARNING - Partially satisfied prerequisites:
      * 19780901T0000Z/postproc is waiting on ['19780901T0000Z/lfnctopp:succeeded']
      * 19780901T0000Z/coupled is waiting on ['19780901T0000Z/compile_lfric_atm:succeeded']
      * 19780901T0000Z/lfnctopp is waiting on ['19780901T0000Z/coupled:succeeded']
2024-01-25T00:30:19Z CRITICAL - Workflow stalled

also i’m nowhere near quota

Are you sure that’s a job.err file? It’s a scheduler log file!

Also, I can’t see any DB errors there.

And it just shows that you manually triggered a task, and the task failed. Which (not surprisingly) cause the workflow to stall because other tasks are waiting on the success of that one.

(And you should look at the actual job.err file to see why the task failed)

Sorry, my bad!

I re-read above, and you said that’s the scheduler log.

yes i’m seeing this in the job.err file.

So I presume you mean you see the DB error there? If so, can you post that?

ah ok cool no worries.

here’s the last 20 lines of the job.err file, thanks

> cylc log u-cy032/run2//19780901T0000Z/compile_lfric_atm -f e|tail -20
Traceback (most recent call last):
  File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/DependencyAnalyser", line 79, in <module>
    fortranAnalyser.analyse(Path(args.source), macroDictionary)
  File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/dependerator/analyser.py", line 322, in analyse
    self._database.removeFile(sourceFilename)
  File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/dependerator/database.py", line 265, in removeFile
    self._database.query(query)
  File "/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/tools/dependerator/database.py", line 119, in query
    cursor.executescript(query)
sqlite3.OperationalError: unable to open database file
make[1]: *** [/scale_wlg_nobackup/filesets/nobackup/niwa00013/williamsjh/cylc-run/u-cy032/run2/share/fcm_make_lfric/extract/lfric/infrastructure/build/analyse.mk:54: science/src/um/src/atmosphere/convection/det_rate_mod-6a.t] Error 1
make: *** [Makefile:148: build] Error 2
[FAIL] make -C ${SOURCE_DIRECTORY} -j ${MAKE_THREADS} ${TARGET} \
[FAIL] PROFILE=${PROFILE} \
[FAIL] RDEF_PRECISION=${RDEF_PRECISION} \
[FAIL] FIX_ENUMS=${FIX_ENUMS} \
[FAIL] VERBOSE=${VERBOSE} <<'__STDIN__'
[FAIL]
[FAIL] '__STDIN__' # return-code=2
2024-01-25T00:34:41Z CRITICAL - failed/ERR

OK, thanks, that was helpful.

The traceback shows that’s not Cylc code, and therefore it’s not your Cylc workflow DB that’s the problem.

It seems your task is running a program that uses an sqlite DB for its own nefarious reasons!

1 Like

ah ok thanks for that, in that case i have no idea what’s going on then haha. this workflow ran past this task last week! cheers