Solved: Modifying qsub directives on retries

I have a suite that sometimes fails due to numerical instability in an ocean model. It retries the task, using ${CYLC_TASK_TRY_NUMBER} to work through an increasingly desperate list of modifications to get the model to run. One of these modifications is to reduce the timestep in the ocean model itself.

This all works fine, but my problem is in setting the PBS queue time limit for the task:
#PBS -l walltime=1200
The limit needs to be big to accommodate those retries with a small model timestep. Normally the task needs only 300 seconds, and so the suite is spending unnecessary extra time queueing.

Is there a way please to make the PBS directive change according to ${CYLC_TASK_TRY_NUMBER}? Perhaps a way to override the default qsub command?

Thanks
Richard

Hi,

I think you have two options:

  1. Broadcast the execution time limit.

    You can modify the suite’s configuration whilst it is running using cylc broadcast.

    Perhaps something like this (untested):

    [foo]
        err-script = """
            # on failure set the "execution time limit" for the next retry.
            cylc broadcast "${CYLC_SUITE_NAME}" \
                -n "${CYLC_TASK_NAME}" \
                -p "${CYLC_TASK_CYCLE_POINT}" \
                -s "[job]execution time limit=${TIMELIMIT[$CYLC_TASK_SUBMIT_NUM]}"
        """
    
  2. Recovery tasks

    I think the more conventional approach is to have a separate task for “desperate retries” e.g. my_model and my_model_short_step. This allows you to configure the recovery task separately which makes things a bit nicer.

    If using Rose Applications you can configure your recovery task to use the same Rose Application (see the ROSE_TASK_APP environment variable).

    The catch is that you need to “clean up” the graph to remove / reset failed tasks in order to allow the suite to continue. We typically use “suicide triggers” here’s a tutorial on how to use them:

    http://metomi.github.io/rose/doc/html/tutorial/cylc/furthertopics/suicide-triggers.html#recovery-task

    Word or warning: use suicide triggers sparingly, negative logic becomes a headache very quickly.

1 Like

Thanks Oliver. This does indeed work:

err-script = """                                                                                                                                                                                           
# on failure set the "execution time limit" for the next retry.                                                                                                                                            
TIMELIMIT=(300 700 700 1400)                                                                                                              
cylc broadcast "${CYLC_SUITE_NAME}" \                                                                                                                                                                      
    -n "${CYLC_TASK_NAME}" \                                                                                                                                                                               
    -p "${CYLC_TASK_CYCLE_POINT}" \                                                                                                                                                                        
    -s "[job]execution time limit=PT${TIMELIMIT[$CYLC_TASK_TRY_NUMBER]}S"                                                                                                                                  
"""