Handling large db sizes with rose-bush

For suites that run for a long time, the db file can get quite large:

[jacinta@host log]$ ls -lah db*
-rwxr-xr-x. 1 user user  96M Oct 15 09:03 db
-rwxr-xr-x. 1 user user 268K Oct 15 07:17 db-journal

even though we archive the cylc points away.

[jacinta@host log]$ ls job/
20201014T0000Z  20201014T0600Z  20201014T1200Z  20201014T1800Z  20201015T0000Z  20201015T0600Z

After a point, rose-bush chokes on the large db size and fails to list either task jobs or cycles list.

Is there any documentation on how to clean up, or otherwise trim the content of the db to both make the file smaller, and to make matters like this easier? If there isn’t instruction at this level, a schema and list of recommendations would be equally appreciated.

Thanks

Hi,

96MB isn’t that big for a Cylc DB. When you say it fails to list jobs or cycles, is it just taking a really long time or is it crashing? The logs might contain some useful diagnostics here.

Looks like I jumped to the wrong conclusion about this particular suite, I was assuming it followed the same issue that I see in another suite (which has a much bigger db). Deleting and re-syncing the cylc-run directory (our default cylc-run locations are not web accessible) for the previous suite solved the problem.

For the problem suite, size does seem to be the problem.

[jacinta@host problem_suite]$ ls -lah log/db
-rwxr-xr-x. 1 user user 1.8G Oct 15 23:46 log/db

This suite runs every 2 minutes, so as you can imagine the list of cycles and tasks has built up. Rose-bush knows how many entries there are in the the db, but fails to to display anything. As per https://ibb.co/KyS30pD there are 78,467 runs in the database.

Job run data (files and directories) is only kept for 1 hour (30 files) so I don’t think it’s balking at the file system. Deleting and re-syncing hasn’t helped.

Any solution I employ can be performed on the rsync-ed directory, so I am not concerned about losing required suite data.

1 Like

Yes 1.8GB is a little on the large side.

There’s no documentation for housekeeping the DB that I’m aware of so I’ll take a stab at it here…

The database has three continuously growing tables, these will be the ones taking up space:

  • task_events
  • task_jobs
  • task_states

For amusement value there is a command you can use to poke at the DB:

$ sqlite_analyzer path/to/db

You can introspect the database structure from the sqlite prompt:

$ sqlite path/to/db
sqlite> . schema

From that introspection here’s the structure of the big tables:

CREATE TABLE task_events(name TEXT, cycle TEXT, time TEXT, submit_num INTEGER, event TEXT, message TEXT);
CREATE TABLE task_jobs(cycle TEXT, name TEXT, submit_num INTEGER, is_manual_submit INTEGER, try_num INTEGER, time_submit TEXT, time_submit_exit TEXT, submit_status INTEGER, time_run TEXT, time_run_exit TEXT, run_signal TEXT, run_status INTEGER, platform_name TEXT, batch_sys_name TEXT, batch_sys_job_id TEXT, PRIMARY KEY(cycle, name, submit_num));
CREATE TABLE task_states(name TEXT, cycle TEXT, flow_label TEXT, time_created TEXT, time_updated TEXT, submit_num INTEGER, status TEXT, PRIMARY KEY(name, cycle, flow_label));

All timestamps are currently stored as ISO8601 strings (which is a contributing factor to that 1.8GB), they are lexicographically sortable so a simple < is all that’s needed to filter out older entries e.g:

SELECT *
FROM task_states
WHERE time_updated < "2020-10-16T08:33:12Z";

So a simple housekeep script could look something like this:

DELETE
FROM task_events
WHERE time < "<time>";

DELETE
FROM task_jobs
WHERE time_submit < "<time>";

DELETE
FROM task_states
WHERE time_updated < "<time>";

Additionally to using the SQLite command line, you could also use https://github.com/cylc/cylc-admin/blob/master/docs/howto/inspect_suite_database.ipynb, which is designed to provide a bit for framework for investigating the database.