Critical disk I/O error

I was running four suites, which went through several cycles and then stopped with what looked like a submission error. When I try to play the suites again, I get an error that looks like:

2022-04-30T16:52:31Z INFO - Workflow: imsi-cylc-antwater-04/run1
2022-04-30T16:52:31Z ERROR - disk I/O error
    Traceback (most recent call last):
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler.py", line 653, in start
        await self.configure()
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler.py", line 425, in configure
        self.is_restart = self.workflow_db_mgr.restart_check()
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/workflow_db_mgr.py", line 639, in restart_check
        pri_dao.vacuum()
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/rundb.py", line 987, in vacuum
        return self.connect().execute("VACUUM")
    sqlite3.OperationalError: disk I/O error
2022-04-30T16:52:31Z CRITICAL - Workflow shutting down - disk I/O error
ESC[0m2022-04-30T16:52:31Z INFO - DONE

About 10 minutes later I try again, and it seems to restart for 3 out of 4 runs, but the fourth one keeps up with this issue. As far as I know there are no reports of disk issues or other (non cylc) jobs experiencing issues. I checked the db file under log and it claims not to have any integrity issues. I wonder if there are any thoughts or suggestions how to proceed, or how to recover if this comes up? thanks!

After restarting some of the flows, I’ve noticed more issues. For example the gui/tui both report suites are stopped, but there are jobs actively running in the queues, with the job internals progressing as expected. If I cylc play the suite (no errors reported), it still reports as being stopped when I launch the tui immediately afterwards. I guess consistent with this I see in the workflow log messages like:

 File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/rundb.py", line 462, in _execute_stmt
        self.conn.executemany(stmt, stmt_args_list)
    sqlite3.OperationalError: disk I/O error
2022-04-30T18:01:40Z CRITICAL - Workflow shutting down - disk I/O error
2022-04-30T18:01:40Z WARNING - Orphaned task jobs:
    * 10030101T0000Z/clean_back_end (running)
    * 10040101T0000Z/transfer_data (running)
    * 10050101T0000Z/model_run (submitted)

I’ve not seen that error, sorry. Presumably there is a problem with the sqlite DB, but it’s hard to say what from here.

Can you try accessing the DB with the sqlite3 command line tool? Note that log/db is the “public” DB, which is provided for 3rd party tools (but is also written by the scheduler, of course). The “private” DB, which is the more important one for the scheduler itself is .service/db in the workflow run directory.

On your second point, yes, if the scheduler shuts down with a problem and reports orphaned tasks, those jobs will be left running. They will record their completion status to disk (job status file) for the scheduler to see if it gets started up again.

Thanks. As far as I can tell, the db itself is not corrupted. For example:

ncs001@hpcr4-vis1:~/cylc-run/imsi-cylc-pictrl-01/run1/.service$ sqlite3 db 'pragma integrity_check'
ok

I think possibly what happened is that the job logs filled up my disk quota in ~/. The original messages did not complain about quota issues, but perhaps it was the underlying issue. When I tried again today, I got clearer quota exceeded messages, and after cleaning up things seem to be working again.

I think possibly what happened is that the job logs filled up my disk quota in ~/

I expect this was the cause, we see this occasionally with Cylc 7. A disk I/O error usually means quota exceeded or filesystem outage.

Cylc 8 has a new “symlink dirs” feature which allows you to relocate the different bits of the workflow’s “run directory” to other locations, e.g. you move the job logs to a different filesystem where space is less of an issue. (before Cylc 8 this functionality was provided by the Rose “root-dir” configuration).

After restarting some of the flows, I’ve noticed more issues. For example the gui/tui both report suites are stopped , but there are jobs actively running in the queues

This is to be expected, “stopped” means that the Cylc “scheduler” is not running. Any jobs which were running/submitted at the time of the crash will continue running, when the workflow is restarted Cylc will poll their status files to catch up with their progress.

If I cylc play the suite (no errors reported), it still reports as being stopped

This is because the workflow is successfully “starting up” but then instantly crashing again when it hits DB/IO issues.

@swartn - thanks for the update. Yes, disk quota exceeded probably explains it. The error message is not very helpful, but unfortunately I don’t think we have any control over that - it’ll be in the sqlite library.

Thanks for the responses. After cleaning up the disks, I got on my way again. Unfortunately after running well for a few days, the suites all went down overnight for an unknown reason. I now have a different error on trying to restart. This time is does not look like a quota issue. The error is again fairly cryptic, but relates to some kind of json parsing. I’m not sure how to resolve this, and perhaps more importantly to avoid it:

(imsi-cylc) ncs001@hpcr4-vis1:~/cylc-run$ cylc play imsi-cylc-pictrl-01
Traceback (most recent call last):
  File "/home/ncs001/.conda/envs/imsi-cylc/bin/cylc", line 10, in <module>
    sys.exit(main())
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scripts/cylc.py", line 665, in main
    execute_cmd(command, *cmd_args)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scripts/cylc.py", line 285, in execute_cmd
    COMMANDS[cmd].resolve()(*args)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/terminal.py", line 226, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler_cli.py", line 400, in play
    return scheduler_cli(options, id_)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler_cli.py", line 292, in scheduler_cli
    detect_old_contact_file(workflow_id)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/workflow_files.py", line 498, in detect_old_contact_file
    process_is_running = _is_process_running(old_host, old_pid, old_cmd)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/workflow_files.py", line 446, in _is_process_running
    process = json.loads(out)[0]
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Hello,

That’s a cryptic one! Thanks for reporting.

Explanation of the traceback (feel free to skip)

In the event that a workflow is not contactable, Cylc has some logic to check whether a workflow is still running. This is in order to tell between a workflow having crashed for some reason and a network issue.

To do this Cylc SSH’es to the host where the scheduler was running (specified in the contact file) and runs this command:

ssh <host> cylc psutil <<< '[["Process", 123]]'
# with 123 replaced with the PID of the scheduler
# (specified in the contact file)

This is a platform-portable process lookup that should either return valid JSON or produce a non-zero (error) exit code.

Your traceback suggests that the command returned a zero (success) error code, however, its result could not be parsed as JSON.

Likely cause

The most likely cause is that one of the Bash config files on the scheduler host is writing something to stdout, e.g. via an echo statement.

The best fix is to remove the command, redirect its output or fix the file so it only runs in interactive mode (some more details on the ticket below).

Remediation

We have built protections into other Cylc commands to make them more robust to such shell configurations, I’ve opened a ticket to add the protection to this command, we’ll try to get this into a Cylc release shortly.