Cylc tasks reported as running but they're not

jonnyhtw · March 24, 2020, 10:37pm

EDIT:

By killing the ‘running’ process and re-triggering it I have now gotten it to work! That said, I had previously tried this and it didn’t work.

I will be running several more of this type of workflow in the future so will report back if I see the same issue again.

Hi there

I’m working on the JASMIN machine doing data analysis for CMIP6.

All our data analysis is done through Cylc suites.

I’ve noticed a persistent problem with some tasks in that they report a running status for hours, sometimes days before completing.

However, when you look at the job.out files after the job has succeeded, it says that the job only took 40 minutes or so.

Is this behaviour known at all? I’m a bit stumped and am not sure how to investigate further…

I’ve also reported this to the JASMIN helpdesk.

Thanks for any tips.

Jonny

dpmatthews · March 25, 2020, 8:08am

Hi Jonny
If Cylc thinks a job is running when it isn’t then that implies that Cylc didn’t receive a message back from the job to indicate it had finished. This could be due to communication issues (the job tried to send a message but failed) or because the job died without getting the chance to send a message. Cylc should recognise the job has completed when it polls the job (assuming regular polling is configured in the site config or in your workflow config).
If you send me details of where to find your workflow on Jasmin I’ll try to take a look to see if I can identify the problem.

hilary.j.oliver · March 29, 2020, 10:47pm

Hi Jonny,

To add a little to what @dpmatthews said: if Cylc thinks a job is still running long after you would expect it to have finished, that almost certainly means that the job did was not able to report its success or failure status back to the scheduler when it finished. If that happens:

polling the task (via CLI or GUI) will launch a process on the job host to determine the true status
the job.err file should report job state message send failures

If job status messages are blocked, that suggest a temporary outage of the network between job and suite host, or a permanent network configuration problem. In the latter case the problem will not fix itself, you will either have to ask the system admins to open the Cylc ports between the hosts, or configure Cylc to track job status by regular automatic polling.

Hilary

jonnyhtw · March 31, 2020, 1:23am

Hi Dave

Thanks a lot for your help here.

I’m afraid that I can’t point you to any helpful job logs since I restarted the data workflow and it subsequently worked.

In my experience of using Cylc on JASMIN so far, this type of behaviour is not uncommon.

I’m currently running an identical workflow on another dataset and will report back here if I get the same issue.

FYI there is a private GitHub repo here if you want to get deeper into this! This is where the CDDS system for JASMIN is developed.

Thanks again

Jonny

jonnyhtw · March 31, 2020, 1:25am

Hey Hilary

I reported this to the sys admins yesterday and got a reply saying that they couldn’t see anything wrong with my suites and/or account.

I’ve given up on this version and started again from scratch and so far it’s running fine.

I’ll flag this up here and to the sys admins again if I get this behaviour again.

See my last reply to @dpmatthews re the GitHub repo if you’re interested…

Thanks again!

Jonny

Topic		Replies	Views
Cylc tui and web ui not detecting running jobs? Cylc Support	5	248	December 14, 2023
When the task ends abnormally(EXIT), cylc gui task status error Cylc Support	3	34	September 13, 2024
Job.err: CylcError: Cannot determine whether workflow is running on Cylc Support	2	328	March 9, 2023
Jobs show as 'submitted' on Cylc GUI but actually they have failed Cylc Support	4	285	May 12, 2023
Showing workflow state after run has finished Cylc Support	3	208	July 3, 2023

Cylc tasks reported as running but they're not

EDIT:

By killing the ‘running’ process and re-triggering it I have now gotten it to work! That said, I had previously tried this and it didn’t work.

I will be running several more of this type of workflow in the future so will report back if I see the same issue again.

Related topics