I have a suite which has tasks processing real time data using python scripts on a daily basis.
It starts daily cycle at given time every day and firstly running check_file tasks to check whether the data of today are ready with ‘execution retry delays = 14400*PT10M’.
If the data (created by another suite) are not ready, it will run again 10m later.
Normally the task will retry more than 200 times (~1.4 days) until it succeeded (due to the data are ready), then the actual tasks start to process today’s data.
But after around 10 days, the suite would get killed quietly (or stopped automatically).
(no final cycle point is specified in the suite, so it wasn’t running out of cycles)
By the time it got killed, only the check_file tasks were running. Other tasks were in waiting status.
Most of time, I couldn’t find any error messages related to the suite killing.
Some time I could find this message:
ection object at 0xc9eb90>: Failed to establish a new connection: [Errno 11= 1] Connection refused’,))
Do you have any idea about the reason? Is there any way to debug this problem?