"report-timings": odd error after restart

I have a suite running in an HPC environment. After running for a few hundred cycle points, the Cylc server daemon disappeared. I believe it was killed off by a system maintenance script that limits how long user processes can run on login nodes. I restarted the suite successfully and the suite run has continued without issue. However, since the restart, “report-timings” commands (which had worked perfectly before the server was killed) now return an error:

$ cylc report-timings MYSUITE
object of type 'NoneType' has no len()
$

I had worried that I would lose the information about timings from before the restart; but it appears that even timing information since the restart is not accessible.

Any ideas? Thanks much!

Hello,

Thanks for reporting this, could you try running the command with the --debug option, this may help us to hone in on the error. Which Cylc version are use using?

Hi, and thanks for the quick response! Cylc version is cylc-flow-7.8.3 (I should have remembered to say that). Output from the debug switch:

$ cylc report-timings --debug MYSUITE <br>
2021-05-21T09:51:25-05:00 DEBUG - Loading site/user global config files<br>
2021-05-21T09:51:25-05:00 DEBUG - Reading file /p/home/cmetzler/.cylc/global.rc<br>
Traceback (most recent call last):
<pre>
  File "/p/home/cmetzler/cylc_work/cylc/cylc-flow-7.8.3/bin/cylc-report-timings", line 398, in &lt;module>
    main()
  File "/p/home/cmetzler/cylc_work/cylc/cylc-flow-7.8.3/bin/cylc-report-timings", line 123, in main
    row_buf = format_rows(*run_db.select_task_times())
  File "/p/home/cmetzler/cylc_work/cylc/cylc-flow-7.8.3/bin/cylc-report-timings", line 148, in format_rows
    (len(h) for h in header)
  File "/p/home/cmetzler/cylc_work/cylc/cylc-flow-7.8.3/bin/cylc-report-timings", line 147, in &lt;genexpr>
    (max(len(r[i]) for r in rows) for i in range(len(header))),
  File "/p/home/cmetzler/cylc_work/cylc/cylc-flow-7.8.3/bin/cylc-report-timings", line 147, in &lt;genexpr>
    (max(len(r[i]) for r in rows) for i in range(len(header))),
</pre>
TypeError: object of type 'NoneType' has no len()

Hm, not sure how to make a horizontal scroll-bar happen. The full length of each traceback line is there in the post, isn’t showing up in my browser; not sure how to make that happen.

(Use triple back-quotes rather than <code> - I just edited your post to do that).

I think that error is suggesting that there aren’t any job entries in the database which may mean that the database has become corrupted.

Normally Cylc captures errors, shutdown requests and SIGINT signals cause Cylc to go through a safe shutdown procedure. Whatever system killed the workflow may have been a little more forceful, perhaps by incredible misfortune cutting off a database transaction midway through completion? Although that isn’t supposed to be possible - Atomic Commit In SQLite.

You could try checking out the database manually to confirm, this command should list all of the job data that report-timings is trying to use:

$ sqlite3 ~/cylc-run/<workflow>/.service/db 'SELECT * FROM task_jobs'

I would have thought that if the database was corrupted, then the suite wouldn’t proceed normally – when in fact it ran through several hundred more cycle points and ended normally.

Anyway, I ran the sql command you suggested and got over 4000 lines of output, each one looking like this (with differing task names/times/etc.):

20140125T0000Z|run_model_advance_q|1|0|1|2021-05-21T17:02:42Z|2021-05-21T17:02:43Z|0|2021-05-21T17:03:29Z|2021-05-21T17:04:11Z||0|localhost|pbs|301686.pbs01

Maybe not, because the report-timings utility reads a DB table that is not read by the scheduler (although it is updated with new entries by the scheduler, so I’m not sure).

Here’s the command line equivalent of exactly what cylc report-timings requests from the DB:

sqlite3 ~/cylc-run/<suite-name>/.service/db \
     "SELECT name, cycle, user_at_host, batch_sys_name,
        time_submit, time_run, time_run_exit
           FROM task_jobs
              WHERE run_status = 0"

i…e.: select submit time, run length, and run exit time, for all task jobs with run_status that indicates job succeeded.

By experiment, the error you reported above will be generated if any of the selected data elements (probably times) are undefined, which ends up as the None value in the Python code. It’s hard to see how that could happen, because the WHERE clause selects for task jobs that exited successfully, which implies the three time values should be defined.

Could you run this command on your DB and search for undefined values in the result? I think they will be reported as missing columns, like this:

cat|dog|mouse
cat||mouse  # <--- dog missing

If you find any of those, you should be able to fix the DB by deleting those entries from the task_jobs table. And we should of course try to figure out how they could possibly have ended up in the DB in the first place.