I am trying to retry a failed job with cylc on a pbs system. But the cylc graphic and cat-state output both show the job as submitted even after the job has already aborted.
34: Lagrangian levels are crossing 9.999999999999999E-012
34: Run will ABORT!
34: Suggest to increase NSPLTVRM
34: ERROR: te_map: Lagrangian levels are crossing
34:Image PC Routine Line Source
34:cesm.exe 000000000375022D Unknown Unknown Unknown
34:cesm.exe 0000000002EBB682 shr_abort_mod_mp_ 114 shr_abort_mod.F90
34:cesm.exe 0000000000F253D9 te_map_mod_mp_te_ 512 te_map.F90
34:cesm.exe 0000000000569332 dyn_comp_mp_dyn_r 2621 dyn_comp.F90
34:cesm.exe 0000000000EE7CA7 stepon_mp_stepon_ 315 stepon.F90
34:cesm.exe 00000000004EC8FF cam_comp_mp_cam_r 243 cam_comp.F90
34:cesm.exe 00000000004DE7ED atm_comp_mct_mp_a 454 atm_comp_mct.F90
34:cesm.exe 0000000000425A44 component_mod_mp_ 728 component_mod.F90
34:cesm.exe 000000000040CABD cime_comp_mod_mp_ 3436 cime_comp_mod.F90
34:cesm.exe 00000000004256EC MAIN__ 125 cime_driver.F90
34:cesm.exe 0000000000408E5E Unknown Unknown Unknown
34:libc.so.6 00002B0DB7DC56E5 __libc_start_main Unknown Unknown
34:cesm.exe 0000000000408D69 Unknown Unknown Unknown
34:MPT ERROR: Rank 34(g:34) is aborting with error code 1001.
34: Process ID: 71751, Host: r8i5n12, Program: /glade/scratch/jedwards/70Lwaccm6.00/bld/cesm.exe
34: MPT Version: HPE MPT 2.19 02/23/19 05:30:09
34:
-1: aborting job
^Ccheyenne6: /glade/work/jedwards/cases_S2S/70Lwaccm6.00
:) cylc cat-state testsuite
run mode : live
time : 2019-08-28T19:27:19Z (1567020439)
initial cycle : None
final cycle : None
(dp0
.
Begin task states
st_archive.19990802T0000Z : status=waiting, spawned=False
get_data.19990802T0000Z : status=succeeded, spawned=True
get_data.19990809T0000Z : status=waiting, spawned=False
run.19990802T0000Z : status=submitted, spawned=True
run.19990809T0000Z : status=waiting, spawned=False
----- suite.rc -------
[meta]
title = test CESM CYLC workflow for 70Lwaccm6.00
[cylc]
UTC mode = True
[scheduling]
initial cycle point = 19990802T0000Z
final cycle point = 19990830T0000Z
[[dependencies]]
[[[R1]]]
graph = "get_data => run => st_archive "
[[[R/P1W]]] # Weekly Cycling
graph = """
st_archive[-P1W] => get_data => run
run => st_archive
"""
[runtime]
[[get_data]]
script = cd /glade/work/jedwards/sandboxes/CESM2-Realtime-Forecast/bin/; ./s2srun.py
[[st_archive]]
script = cd /glade/work/jedwards/cases_S2S/70Lwaccm6.00; ./case.submit --job case.st_archive; ./xmlchange CONTINUE_RUN=TRUE
[[run]]
script = cd /glade/work/jedwards/cases_S2S/70Lwaccm6.00; ./case.submit --job case.run
[[[job]]]
execution retry delays = PT0S, PT10S, PT20S, PT30S
batch system = pbs
batch submit command template = qsub -q regular -l walltime=04:00:00 -A P93300606 '%(job)s'
[[[directives]]]
-r = n
-j = oe
-V =
-S = /bin/bash
-l = select=15:ncpus=36:mpiprocs=36:ompthreads=1
[[[events]]]
handlers = cp ../user_nl_cam.${CYLC_TASK_TRY_NUMBER} user_nl_cam
handler events = retry, failed