Critical disk I/O error

swartn · April 30, 2022, 5:06pm

I was running four suites, which went through several cycles and then stopped with what looked like a submission error. When I try to play the suites again, I get an error that looks like:

2022-04-30T16:52:31Z INFO - Workflow: imsi-cylc-antwater-04/run1
2022-04-30T16:52:31Z ERROR - disk I/O error
    Traceback (most recent call last):
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler.py", line 653, in start
        await self.configure()
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler.py", line 425, in configure
        self.is_restart = self.workflow_db_mgr.restart_check()
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/workflow_db_mgr.py", line 639, in restart_check
        pri_dao.vacuum()
      File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/rundb.py", line 987, in vacuum
        return self.connect().execute("VACUUM")
    sqlite3.OperationalError: disk I/O error
2022-04-30T16:52:31Z CRITICAL - Workflow shutting down - disk I/O error
ESC[0m2022-04-30T16:52:31Z INFO - DONE

About 10 minutes later I try again, and it seems to restart for 3 out of 4 runs, but the fourth one keeps up with this issue. As far as I know there are no reports of disk issues or other (non cylc) jobs experiencing issues. I checked the db file under log and it claims not to have any integrity issues. I wonder if there are any thoughts or suggestions how to proceed, or how to recover if this comes up? thanks!

swartn · April 30, 2022, 6:05pm

After restarting some of the flows, I’ve noticed more issues. For example the gui/tui both report suites are stopped, but there are jobs actively running in the queues, with the job internals progressing as expected. If I cylc play the suite (no errors reported), it still reports as being stopped when I launch the tui immediately afterwards. I guess consistent with this I see in the workflow log messages like:

 File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/rundb.py", line 462, in _execute_stmt
        self.conn.executemany(stmt, stmt_args_list)
    sqlite3.OperationalError: disk I/O error
2022-04-30T18:01:40Z CRITICAL - Workflow shutting down - disk I/O error
2022-04-30T18:01:40Z WARNING - Orphaned task jobs:
    * 10030101T0000Z/clean_back_end (running)
    * 10040101T0000Z/transfer_data (running)
    * 10050101T0000Z/model_run (submitted)

hilary.j.oliver · May 1, 2022, 10:42pm

I’ve not seen that error, sorry. Presumably there is a problem with the sqlite DB, but it’s hard to say what from here.

Can you try accessing the DB with the sqlite3 command line tool? Note that log/db is the “public” DB, which is provided for 3rd party tools (but is also written by the scheduler, of course). The “private” DB, which is the more important one for the scheduler itself is .service/db in the workflow run directory.

On your second point, yes, if the scheduler shuts down with a problem and reports orphaned tasks, those jobs will be left running. They will record their completion status to disk (job status file) for the scheduler to see if it gets started up again.

swartn · May 2, 2022, 3:20pm

Thanks. As far as I can tell, the db itself is not corrupted. For example:

ncs001@hpcr4-vis1:~/cylc-run/imsi-cylc-pictrl-01/run1/.service$ sqlite3 db 'pragma integrity_check'
ok

I think possibly what happened is that the job logs filled up my disk quota in ~/. The original messages did not complain about quota issues, but perhaps it was the underlying issue. When I tried again today, I got clearer quota exceeded messages, and after cleaning up things seem to be working again.

oliver.sanders · May 3, 2022, 8:59am

I think possibly what happened is that the job logs filled up my disk quota in ~/

I expect this was the cause, we see this occasionally with Cylc 7. A disk I/O error usually means quota exceeded or filesystem outage.

Cylc 8 has a new “symlink dirs” feature which allows you to relocate the different bits of the workflow’s “run directory” to other locations, e.g. you move the job logs to a different filesystem where space is less of an issue. (before Cylc 8 this functionality was provided by the Rose “root-dir” configuration).

After restarting some of the flows, I’ve noticed more issues. For example the gui/tui both report suites are stopped , but there are jobs actively running in the queues

This is to be expected, “stopped” means that the Cylc “scheduler” is not running. Any jobs which were running/submitted at the time of the crash will continue running, when the workflow is restarted Cylc will poll their status files to catch up with their progress.

If I cylc play the suite (no errors reported), it still reports as being stopped

This is because the workflow is successfully “starting up” but then instantly crashing again when it hits DB/IO issues.

hilary.j.oliver · May 4, 2022, 1:12am

@swartn - thanks for the update. Yes, disk quota exceeded probably explains it. The error message is not very helpful, but unfortunately I don’t think we have any control over that - it’ll be in the sqlite library.

swartn · May 4, 2022, 4:03pm

Thanks for the responses. After cleaning up the disks, I got on my way again. Unfortunately after running well for a few days, the suites all went down overnight for an unknown reason. I now have a different error on trying to restart. This time is does not look like a quota issue. The error is again fairly cryptic, but relates to some kind of json parsing. I’m not sure how to resolve this, and perhaps more importantly to avoid it:

(imsi-cylc) ncs001@hpcr4-vis1:~/cylc-run$ cylc play imsi-cylc-pictrl-01
Traceback (most recent call last):
  File "/home/ncs001/.conda/envs/imsi-cylc/bin/cylc", line 10, in <module>
    sys.exit(main())
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scripts/cylc.py", line 665, in main
    execute_cmd(command, *cmd_args)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scripts/cylc.py", line 285, in execute_cmd
    COMMANDS[cmd].resolve()(*args)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/terminal.py", line 226, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler_cli.py", line 400, in play
    return scheduler_cli(options, id_)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/scheduler_cli.py", line 292, in scheduler_cli
    detect_old_contact_file(workflow_id)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/workflow_files.py", line 498, in detect_old_contact_file
    process_is_running = _is_process_running(old_host, old_pid, old_cmd)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/site-packages/cylc/flow/workflow_files.py", line 446, in _is_process_running
    process = json.loads(out)[0]
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/ncs001/miniconda3/envs/imsi-cylc/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

oliver.sanders · May 4, 2022, 4:44pm

Hello,

That’s a cryptic one! Thanks for reporting.

Explanation of the traceback (feel free to skip)

In the event that a workflow is not contactable, Cylc has some logic to check whether a workflow is still running. This is in order to tell between a workflow having crashed for some reason and a network issue.

To do this Cylc SSH’es to the host where the scheduler was running (specified in the contact file) and runs this command:

ssh <host> cylc psutil <<< '[["Process", 123]]'
# with 123 replaced with the PID of the scheduler
# (specified in the contact file)

This is a platform-portable process lookup that should either return valid JSON or produce a non-zero (error) exit code.

Your traceback suggests that the command returned a zero (success) error code, however, its result could not be parsed as JSON.

Likely cause

The most likely cause is that one of the Bash config files on the scheduler host is writing something to stdout, e.g. via an echo statement.

The best fix is to remove the command, redirect its output or fix the file so it only runs in interactive mode (some more details on the ticket below).

Remediation

We have built protections into other Cylc commands to make them more robust to such shell configurations, I’ve opened a ticket to add the protection to this command, we’ll try to get this into a Cylc release shortly.

github.com/cylc/cylc-flow

contact: process check should handle dirty JSON

opened 04:41PM - 04 May 22 UTC

oliver-sanders

bug

The process check run to determine whether a scheduler process is still running …does not tolerate "dirty" output. This "dirty" output describes when other things write to stdout during the command execution polluting the command output. This is typically caused by shell configuration files e.g. Bashrc: ```bash # ~/.bashrc echo "Hello $USER" # this will pollute command output ``` The solution is to remove or suppress the polluting command or alternatively to stick it in a branch which only runs in interactive mode so it can't mess with non-interactive commands: ```bash # ~/.bashrc if [[ $- == *i* ]]; then # this will only run in interactive mode echo "Hello $USER" fi ``` We see this issue quite often so have built in protection against it here - `cylc.flow.terminal.parse_dirty_json` Somebody (yours truely, sorry) forgot to add this into the process check when it moved over to the `cylc psutil` implementation (`cylc.flow.workflow_files._is_process_running`) :cry:. **Pull requests welcome!**

jonnyhtw · March 24, 2024, 9:46pm

UPDATE —> quota does seem to be the issue but the tool we use to check for quota limits updates very slowly

hi there,

i am getting the same error after my workflows crashed over the weekend.

i am definitely not over quota so i’m not sure how to proceed.

is there anything else that could be the cause of the issue?

thanks

Topic		Replies	Views
Suite stuck after fixing graph issue Cylc 8 Migration	8	377	May 16, 2023
Re-trying failed db transactions Cylc Support	4	45	May 22, 2025
Cylc 8.2.1 report-timings error Cylc Support	6	250	October 13, 2023
Failed (remote) cylc task fails but server fails to notice Cylc Support	5	485	July 28, 2022
Cannot tell if the workflow is running error in failed workflow Cylc Support	9	61	October 30, 2024

Critical disk I/O error

Explanation of the traceback (feel free to skip)

Likely cause

Remediation

Related topics