Cylc tasks reported as running but they're not

EDIT:

By killing the ‘running’ process and re-triggering it I have now gotten it to work! That said, I had previously tried this and it didn’t work.

I will be running several more of this type of workflow in the future so will report back if I see the same issue again. :slight_smile:

Hi there

I’m working on the JASMIN machine doing data analysis for CMIP6.

All our data analysis is done through Cylc suites.

I’ve noticed a persistent problem with some tasks in that they report a running status for hours, sometimes days before completing.

However, when you look at the job.out files after the job has succeeded, it says that the job only took 40 minutes or so.

Is this behaviour known at all? I’m a bit stumped and am not sure how to investigate further…

I’ve also reported this to the JASMIN helpdesk.

Thanks for any tips.

Jonny

Hi Jonny
If Cylc thinks a job is running when it isn’t then that implies that Cylc didn’t receive a message back from the job to indicate it had finished. This could be due to communication issues (the job tried to send a message but failed) or because the job died without getting the chance to send a message. Cylc should recognise the job has completed when it polls the job (assuming regular polling is configured in the site config or in your workflow config).
If you send me details of where to find your workflow on Jasmin I’ll try to take a look to see if I can identify the problem.

1 Like

Hi Jonny,

To add a little to what @dpmatthews said: if Cylc thinks a job is still running long after you would expect it to have finished, that almost certainly means that the job did was not able to report its success or failure status back to the scheduler when it finished. If that happens:

  • polling the task (via CLI or GUI) will launch a process on the job host to determine the true status
  • the job.err file should report job state message send failures

If job status messages are blocked, that suggest a temporary outage of the network between job and suite host, or a permanent network configuration problem. In the latter case the problem will not fix itself, you will either have to ask the system admins to open the Cylc ports between the hosts, or configure Cylc to track job status by regular automatic polling.

Hilary

1 Like

Hi Dave

Thanks a lot for your help here.

I’m afraid that I can’t point you to any helpful job logs since I restarted the data workflow and it subsequently worked.

In my experience of using Cylc on JASMIN so far, this type of behaviour is not uncommon.

I’m currently running an identical workflow on another dataset and will report back here if I get the same issue.

FYI there is a private GitHub repo here if you want to get deeper into this! This is where the CDDS system for JASMIN is developed.

Thanks again

Jonny

Hey Hilary

I reported this to the sys admins yesterday and got a reply saying that they couldn’t see anything wrong with my suites and/or account.

I’ve given up on this version and started again from scratch and so far it’s running fine.

I’ll flag this up here and to the sys admins again if I get this behaviour again.

See my last reply to @dpmatthews re the GitHub repo if you’re interested…

Thanks again!

Jonny