Cylc not capturing failed jobs

I am trying to retry a failed job with cylc on a pbs system. But the cylc graphic and cat-state output both show the job as submitted even after the job has already aborted.

34: Lagrangian levels are crossing  9.999999999999999E-012
34: Run will ABORT!
34: Suggest to increase NSPLTVRM
34: ERROR: te_map: Lagrangian levels are crossing
34:Image              PC                Routine            Line        Source             
34:cesm.exe           000000000375022D  Unknown               Unknown  Unknown
34:cesm.exe           0000000002EBB682  shr_abort_mod_mp_         114  shr_abort_mod.F90
34:cesm.exe           0000000000F253D9  te_map_mod_mp_te_         512  te_map.F90
34:cesm.exe           0000000000569332  dyn_comp_mp_dyn_r        2621  dyn_comp.F90
34:cesm.exe           0000000000EE7CA7  stepon_mp_stepon_         315  stepon.F90
34:cesm.exe           00000000004EC8FF  cam_comp_mp_cam_r         243  cam_comp.F90
34:cesm.exe           00000000004DE7ED  atm_comp_mct_mp_a         454  atm_comp_mct.F90
34:cesm.exe           0000000000425A44  component_mod_mp_         728  component_mod.F90
34:cesm.exe           000000000040CABD  cime_comp_mod_mp_        3436  cime_comp_mod.F90
34:cesm.exe           00000000004256EC  MAIN__                    125  cime_driver.F90
34:cesm.exe           0000000000408E5E  Unknown               Unknown  Unknown
34:libc.so.6          00002B0DB7DC56E5  __libc_start_main     Unknown  Unknown
34:cesm.exe           0000000000408D69  Unknown               Unknown  Unknown
34:MPT ERROR: Rank 34(g:34) is aborting with error code 1001.
34:	Process ID: 71751, Host: r8i5n12, Program: /glade/scratch/jedwards/70Lwaccm6.00/bld/cesm.exe
34:	MPT Version: HPE MPT 2.19  02/23/19 05:30:09
34:
-1: aborting job

^Ccheyenne6: /glade/work/jedwards/cases_S2S/70Lwaccm6.00

:) cylc cat-state testsuite 

run mode : live
time : 2019-08-28T19:27:19Z (1567020439)
initial cycle : None
final cycle : None
(dp0
.
Begin task states
st_archive.19990802T0000Z : status=waiting, spawned=False
get_data.19990802T0000Z : status=succeeded, spawned=True
get_data.19990809T0000Z : status=waiting, spawned=False
run.19990802T0000Z : status=submitted, spawned=True
run.19990809T0000Z : status=waiting, spawned=False
----- suite.rc -------
[meta]
  title = test CESM CYLC workflow for 70Lwaccm6.00
[cylc]
  UTC mode = True

[scheduling]
  initial cycle point = 19990802T0000Z
  final cycle point = 19990830T0000Z

  [[dependencies]]
    [[[R1]]]
      graph = "get_data => run => st_archive "
    [[[R/P1W]]] # Weekly Cycling
      graph = """
	     st_archive[-P1W] => get_data => run
             run => st_archive
	     """
[runtime]
  [[get_data]]
    script = cd /glade/work/jedwards/sandboxes/CESM2-Realtime-Forecast/bin/; ./s2srun.py
  [[st_archive]]
    script = cd /glade/work/jedwards/cases_S2S/70Lwaccm6.00; ./case.submit --job case.st_archive; ./xmlchange CONTINUE_RUN=TRUE

    [[run]]
    script = cd /glade/work/jedwards/cases_S2S/70Lwaccm6.00; ./case.submit --job case.run
    [[[job]]]
      execution retry delays = PT0S, PT10S, PT20S, PT30S
      batch system = pbs
      batch submit command template = qsub    -q regular -l walltime=04:00:00 -A P93300606  '%(job)s'
    [[[directives]]]
       -r = n
       -j = oe
       -V =
       -S = /bin/bash
       -l = select=15:ncpus=36:mpiprocs=36:ompthreads=1
    [[[events]]]
	handlers = cp ../user_nl_cam.${CYLC_TASK_TRY_NUMBER} user_nl_cam
	handler events = retry, failed

What I really want to happen is to copy a new inputfile into the run directory and then restart the run, I’m not sure if what I have in the suite.rc is going to do that correctly even if cylc correctly detects the failure. Any suggestions? thanks

Hi Jim,

If your Cylc server program thinks a job is still “submitted” even though it is running or has completed (succeeded or failed) that implies that job status messages are not able to be sent back from the job host to the suite host. If you look at the job.err file for the task you should see evidence of a bunch of message send failures.

If it’s just a naming issue (the suite host server has to self-identify as the same name or IP address that is seen from the job host) see https://cylc.github.io/doc/built-sphinx/appendices/site-user-config-ref.html#suite-host-self-identification

However, it is usually a network configuration issue: the suite host server (and Cylc ports) must be visible from the compute node that PBS runs your job on. You can test this with a job that tries to ping (or whatever) the suite host. If this is the problem, you either have to get you system admins to allow this job-to-suite communication, OR configure Cylc to use suite-to-job-host polling to update job status. Polling is not as good because you only find out true job status after each polling interval, but it may be good enough if you can’t get reverse comms back over the network.

Hilary

I was able to fix that with suite-host-self-identification. The retry now occurs, but the handler that I want to move a new namelist into place doesn’t work, how do I go about that step?

That’s good.

The retry now occurs, but the handler that I want to move a new namelist into place doesn’t work, how do I go about that step?

Are you saying that you’ve attached a handler to the retry event, but it doesn’t run at all, or that it doesn’t do what it is supposed to do before the retrying task runs?

I’ll guess the latter, since event handler execution should not be broken (it is well tested).

Note that event handlers are reactive - they are executed by the suite server program, on the suite host, in response to an event - not on the job host before the event occurs. So you can’t use a retry handler to modify the configuration of a task before it retries. You can do that sort of thing by accessing the try number in job scripting, as described here: https://cylc.github.io/doc/built-sphinx-single/index.html#task-retry-on-failure

Hilary

Yes, I have it working now.

Thanks,