Similar to a problem I was having recently with background tasks, I’m seeing task scripts submitted in PBS that start on a compute node and run to successful completion, but never change status with the Cylc server from “submitted”. The problem appears to be that the messages the tasks send to the server to indicate their state change are not able to get through. For example, the job.out file for a task that hung in this way includes:
2021-07-23T23:40:30Z INFO - started
2021-07-23T23:40:30Z WARNING - Message send failed, try 1 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:35Z WARNING - Message send failed, try 2 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:40Z WARNING - Message send failed, try 3 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:45Z WARNING - Message send failed, try 4 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:50Z WARNING - Message send failed, try 5 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:55Z WARNING - Message send failed, try 6 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:00Z WARNING - Message send failed, try 7 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
2021-07-23T23:41:00Z INFO - succeeded
2021-07-23T23:41:00Z WARNING - Message send failed, try 1 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:05Z WARNING - Message send failed, try 2 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:10Z WARNING - Message send failed, try 3 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:15Z WARNING - Message send failed, try 4 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:20Z WARNING - Message send failed, try 5 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:25Z WARNING - Message send failed, try 6 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:30Z WARNING - Message send failed, try 7 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
I know that I should be able to get around communication issues like this by switching to polling. But I’d rather not do that, because the suite is large and complicated, and timeliness is a major factor. Also, there’s something tickling the back of my memory that there’s some other drawback to polling, like issues with other types of messages (like BROADCAST commands to change the environment of a different task) making it through, but I can’t find that in the docs right now so maybe I’m imagining it.
Anyway, as mentioned, this is sporadic; and I’m 100% willing to believe this is a platform issue and not a Cylc issue. But before I can report a trouble ticket to the help desk folks for the HPC in question, I need to diagnose this as well as I can so that I can give them as much info as possible. I can’t presume they know anything about Cylc. Any additional info about what kinds of network issues can cause these messages from Cylc? Any other advice?
Thanks!