Cylc Flow 8.1.0 tutorial database incompatible or corrupted

Hello,

I am very very new to Cylc community.

Here is my tutorial issue:

https://cylc.github.io/cylc-doc/latest/html/tutorial/runtime/introduction.html#admonition-7

$ cylc version
8.1.0

$ cylc get-resources tutorial/runtime-introduction
INFO - Extracting tutorial/runtime-tutorial to /home/teemsis/cylc-src/runtime-introduction

$ cd /home/teemsis/cylc-src/runtime-introduction
$ cylc validate --debug .
2023-01-20T16:05:46Z DEBUG - Reading file /home/teemsis/cylc-src/runtime-introduction/flow.cylc
2023-01-20T16:05:46Z DEBUG - Processing with Jinja2
2023-01-20T16:05:47Z DEBUG - Setting Jinja2 template variables:
    + CYLC_TEMPLATE_VARS={'CYLC_VERSION': '8.1.0', 'CYLC_TEMPLATE_VARS': {...}}
    + CYLC_VERSION=8.1.0
2023-01-20T16:05:47Z DEBUG - Expanding [runtime] namespace lists and parameters
2023-01-20T16:05:47Z DEBUG - Parsing the runtime namespace hierarchy
2023-01-20T16:05:47Z DEBUG - Parsing [special tasks]
2023-01-20T16:05:47Z DEBUG - Parsing the dependency graph
Instantiating tasks to check trigger expressions
2023-01-20T16:05:47Z DEBUG - Loading site/user config files
  + 20000101T0000Z/get_observations_camborne ok
  + 20000101T0000Z/consolidate_observations ok
  + 20000101T0000Z/get_observations_heathrow ok
  + 20000101T0000Z/get_observations_shetland ok
  + 20000101T0000Z/get_observations_aldergrove ok
  + 20000101T0000Z/forecast ok
  + 20000101T0000Z/get_rainfall ok
  + 20000101T0000Z/process_exeter ok
Valid for cylc-8.1.0

$ cylc install
INSTALLED runtime-introduction/run1 from /home/teemsis/cylc-run/runtime-introduction

$ cylc play --debug runtime-introduction
2023-01-20T16:21:48Z DEBUG - Loading site/user config files

 ▪ ■  Cylc Workflow Engine 8.1.0
  ██   Copyright (C) 2008-2023 NIWA
  ▝▘    & British Crown (Met Office) & Contributors

  2023-01-20T16:21:48Z DEBUG - /home/teemsis/cylc-run/runtime-introduction/run4/log/scheduler: directory created
  2023-01-20T16:21:48Z DEBUG - /home/teemsis/cylc-run/runtime-introduction/run4/log/job: directory created
  2023-01-20T16:21:48Z DEBUG - /home/teemsis/cylc-run/runtime-introduction/run4/log/config: directory created
  2023-01-20T16:21:48Z DEBUG - /home/teemsis/cylc-run/runtime-introduction/run4/share: directory created
  2023-01-20T16:21:48Z DEBUG - /home/teemsis/cylc-run/runtime-introduction/run4/work: directory created
  2023-01-20T16:21:48Z INFO - Extracting job.sh to /home/teemsis/cylc-run/runtime-introduction/run4/.service/etc/job.sh

Then, the tutorial says:

The tasks will start to run - you should see them going through the *waiting*, *running* and *succeeded* states. The *preparing* and *submitted* states may be too quick to notice.

But, my job seems in state stopped (looked with cylc tui) and when I try to play again, here is my issue:

Traceback (most recent call last):
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/workflow_db_mgr.py", line 771, in _get_last_run_ver
    last_run_ver = self._get_last_run_version(pri_dao)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/workflow_db_mgr.py", line 697, in _get_last_run_version
    return pri_dao.connect().execute(
TypeError: 'NoneType' object is not subscriptable (key 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/mambaforge-pypy3/bin/cylc", line 10, in <module>
    sys.exit(main())
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/scripts/cylc.py", line 653, in main
    execute_cmd(command, *cmd_args)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/scripts/cylc.py", line 286, in execute_cmd
    entry_point.resolve()(*args)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/terminal.py", line 232, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/scheduler_cli.py", line 624, in play
    return _play(parser, options, id_)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/scheduler_cli.py", line 634, in _play
    return scheduler_cli(options, id_)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/scheduler_cli.py", line 378, in scheduler_cli
    if not _version_check(
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/scheduler_cli.py", line 461, in _version_check
    last_run_version = wdbm.check_workflow_db_compatibility()
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/workflow_db_mgr.py", line 799, in check_workflow_db_compatibility
    last_run_ver = self._get_last_run_ver(pri_dao)
  File "/opt/mambaforge-pypy3/lib/pypy3.9/site-packages/cylc/flow/workflow_db_mgr.py", line 773, in _get_last_run_ver
    raise ServiceFileError(f"{INCOMPAT_MSG}, or is corrupted.")
cylc.flow.exceptions.ServiceFileError: Workflow database is incompatible with Cylc 8.1.0, or is corrupted.

Could you help be to understand what cylc.flow.exceptions.ServiceFileError: Workflow database is incompatible with Cylc 8.1.0, or is corrupted. error means, please?

Best,
Teemsis

Hi @Teemsis

I’ve just run the same commands as you, with a fresh install of cylc-8.1.0 from conda-forge, and the tutorial workflow ran correctly. It runs to completion pretty quickly (~30 seconds) and then the scheduler shuts down. If I try to play it again without installing a new instance, the scheduler shuts down again almost immediately because the workflow already completed.

The error you got suggests that you somehow tried to re-play it with a different version of Cylc. Is that possible? do you have multiple versions installed on your system?

To help us debug this, please try the following:

  • install a new copy of the workflow (type cylc install again, to install run2 or later of the source workflow)
  • play it with the --no-detach option so that the scheduler logs messages to your terminal as it runs. Does the workflow run to completion, or does it report errors?

Note if you don’t use cylc play --no-detach you can still view the scheduler log later using the cylc log command.

Hi @hilary.j.oliver !

Many thanks for your suggestions.

First, I am sure Cylc 8.1.0 is my only instance of Cylc.
This is my first time using it and it is managed by Mamba/conda-forge (mamba install -c conda-forge cylc-flow=8.1.0).

Then, the following commands works pretty well:

  1. “Reset” my Cylc files
$ rm -rf ~/cylc-src/ ~/cylc-run/
  1. Get tutorial/runtime-introduction
$ cylc get-resources tutorial/runtime-introduction
$ cd ~/cylc-src/runtime-introduction
$ cylc validate .
  1. Run jobs with --no-detach
$ cylc install
$ cylc play --no-detach runtime-introduction #run1
$ cylc install
$ cylc play --no-detach runtime-introduction #run2
$ cylc log runtime-introduction/run1
$ cylc log runtime-introduction/run2

Do you have any idea why it works with cylc play --no-detach but not without this option?

Certainly I use it wrong :slight_smile:

(There is no stopped state for jobs, only the workflow.)

Are you able to reproduce the original error if you play the workflow again without --no-detach? It looks to me like the database was corrupted, which may be a one-off issue?

Hi,

I’ve just spotted from your traceback that your running Cylc with Pypy:

lib/pypy3.9/site-packages

Was pypy already installed in your environment when you mamba installed cylc-flow?

Unfortunately we don’t support or test with Pypy at the moment so can’t guarantee that Cylc will work with it. I did try supporting Pypy a while back but encountered issues with Cylc’s dependencies. Note this doesn’t prevent you from using Pypy to run jobs within you Cylc workflow, but the Cylc scheduler itself is not known to work with Pypy.

In some cases it is possible to see the “workflow database in incompatible” message if the Scheduler crashes in an unexpected way preventing database writes from completing. The evidence of this may still be around in one of the ~/cylc-run/<workflow-id>/log/scheduler/log files.

Sorry you’ve encountered this problem, not the best start with Cylc. It normally works, honest!

Yep, the workflow is stopped, you’re right :slight_smile:

Amazing, if I run --no-detach the first time, then I am able to run in live mode :sweat_smile:

Maybe a hint? What do you think?

Ok, I did it again without pypy3:

# Download the latest version of Mambaforge-Linux-x86_64.sh using wget
$ wget https://github.com/conda-forge/miniforge/releases/latest/download/\
Mambaforge-Linux-x86_64.sh

# Run the installer script, using the -b to skip license
# and -p options to specify the installation directory
$ /bin/bash Mambaforge-Linux-x86_64.sh -b -p /opt/mambaforge/

# Use mamba to install cylc-flow
# Install the browser-GUI (optional)
# Install Rose support (optional)
/opt/mambaforge/bin/mamba install -c conda-forge -y \
    python=3.9 \
    cylc-flow \
    cylc-uiserver \
    cylc-rose metomi-rose

I rerun from the beginning and found the log file you mentioned:

$ cat ~/cylc-run/runtime-introduction/run1/log/scheduler/log                                                  
2023-01-24T16:52:12Z INFO - Workflow: runtime-introduction/run1
2023-01-24T16:52:13Z INFO - Scheduler: url=tcp://xxxx:43020 pid=8
2023-01-24T16:52:13Z INFO - Workflow publisher: url=tcp://xxxx:43035
2023-01-24T16:52:13Z INFO - Run: (re)start number=1, log rollover=1
2023-01-24T16:52:13Z INFO - Cylc version: 8.1.0
2023-01-24T16:52:13Z INFO - Run mode: live
2023-01-24T16:52:13Z INFO - Initial point: 20000101T0000Z
2023-01-24T16:52:13Z INFO - Final point: 20000101T0600Z
2023-01-24T16:52:13Z INFO - Cold start from 20000101T0000Z
2023-01-24T16:52:13Z INFO - New flow: 1 (original flow from 20000101T0000Z) 2023-01-24 16:52:13
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_aldergrove waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_aldergrove waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_aldergrove waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_aldergrove waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0600Z/get_observations_aldergrove waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0600Z/get_observations_aldergrove waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_camborne waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_camborne waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_camborne waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_camborne waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0600Z/get_observations_camborne waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0600Z/get_observations_camborne waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_heathrow waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_heathrow waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_heathrow waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_heathrow waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0600Z/get_observations_heathrow waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0600Z/get_observations_heathrow waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_shetland waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0000Z/get_observations_shetland waiting job:00 flows:1] => waiting(queued)
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_shetland waiting(runahead) job:00 flows:1] => waiting
2023-01-24T16:52:13Z INFO - [20000101T0300Z/get_observations_shetland waiting job:00 flows:1] => waiting(queued)

It seems blocked in waiting step?

But, as I said to @MetRonnie I my first job run1 is launched with --no-detach, I do not have any database issue if I play again in live mode.

And if I launch another job run2 in live mode, I have the databse issue. I need to launch it first with --no-detach.

Hi,

Are you saying that when you start the workflow in --no-deatch mode it works, but when you don’t you get a database issue?

It would be useful to know what Cylc commands you ran which resulted in this to help debug.

I think we might be talking across terms, in Cylc language:

  • cylc play starts a workflow.
  • The workflow submits jobs.
  • Normally the workflow runs in “detached” mode which means the workflow keeps running when you close your terminal.
  • The --no-detach option makes the workflow run in your terminal.
  • The “live” mode is different from this, you can change the run mode to “simulation” get Cylc to simulate jobs rather than actually running them to help with workflow design.

Just to add to that:

The run1, run2 etc. are installations, not jobs

  • Installations are created when you run cylc install
  • Jobs are submitted for each task in the graph when you run cylc play

And just to add a little bit more:

The run1 and run1 etc. are separate installations of your workflow, created each time you run cylc install. This is how Cylc allows you to easily run multiple instances of a whole workflow, even at the same time.

Thank you again everyone for your help.

Today I’ve found a workaround.

After the install step, I run cylc clean runtime-introduction/run1 --rm .service/db and my problem is solved.

That’s interesting. The .service/db file is the workflow database and is created when you cylc play a workflow installation for the first time. It should not be present when you install a workflow.

Can I check the commands you’re running are:

$ cylc get-resources tutorial
$ cylc install tutorial/cylc-forecasting-workflow
$ cylc play tutorial/cylc-forecasting-workflow

Thanks to your message I know what’s going on.
I run cylc through a podman container so the sub-process which should run the tasks is killed just after the play command is done.

So, the db is just creating but in a “dirty” state.

For sure, I must use --no-detach in background to definitely solve my problem. Like a classic daemon.