PBS tasks unable to pass status change messages to server

Similar to a problem I was having recently with background tasks, I’m seeing task scripts submitted in PBS that start on a compute node and run to successful completion, but never change status with the Cylc server from “submitted”. The problem appears to be that the messages the tasks send to the server to indicate their state change are not able to get through. For example, the job.out file for a task that hung in this way includes:

2021-07-23T23:40:30Z INFO - started
2021-07-23T23:40:30Z WARNING - Message send failed, try 1 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:35Z WARNING - Message send failed, try 2 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:40Z WARNING - Message send failed, try 3 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:45Z WARNING - Message send failed, try 4 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:50Z WARNING - Message send failed, try 5 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:40:55Z WARNING - Message send failed, try 6 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:00Z WARNING - Message send failed, try 7 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
2021-07-23T23:41:00Z INFO - succeeded
2021-07-23T23:41:00Z WARNING - Message send failed, try 1 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:05Z WARNING - Message send failed, try 2 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:10Z WARNING - Message send failed, try 3 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:15Z WARNING - Message send failed, try 4 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:20Z WARNING - Message send failed, try 5 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:25Z WARNING - Message send failed, try 6 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-23T23:41:30Z WARNING - Message send failed, try 7 of 7: Not authorized: http://onyx.erdc.hpc.mil:43062/put_messages: cylc-message: access type 'private'

I know that I should be able to get around communication issues like this by switching to polling. But I’d rather not do that, because the suite is large and complicated, and timeliness is a major factor. Also, there’s something tickling the back of my memory that there’s some other drawback to polling, like issues with other types of messages (like BROADCAST commands to change the environment of a different task) making it through, but I can’t find that in the docs right now so maybe I’m imagining it.

Anyway, as mentioned, this is sporadic; and I’m 100% willing to believe this is a platform issue and not a Cylc issue. But before I can report a trouble ticket to the help desk folks for the HPC in question, I need to diagnose this as well as I can so that I can give them as much info as possible. I can’t presume they know anything about Cylc. Any additional info about what kinds of network issues can cause these messages from Cylc? Any other advice?

Thanks!

Hi @funkapus

First thing, once a Cylc task job is running on the job host it makes no difference how it got there - PBS or “background” etc. (although PBS might assign your job to different nodes, of course).

The error messages show that a successful connection to the scheduler was made over the network, but the type of action (registering a job status message) was rejected as not authorized. Cylc 7 authorizes only two kinds of access: “private” for full access, if you have the automatically-generated suite “passphrase” file in the right place on disk; and “public” for read-only access, if you don’t.

The passphrase file, like the ssl certificate, is automatically installed on the job submission host (via ssh) the first time a job is submitted there by the suite, in ~/cylc-run/<suite-name>/.service/ on the host (if it doesn’t share a filesystem with the scheduler host).

So your job.err messages above suggest that the suite passphrase file is wrong or has been deleted. I can generate the exact same errors by deliberately messing with the passphrase file in a test suite:

[scheduling]
  [[dependencies]]
       graph = foo
[runtime]
  [[foo]]
     script = "echo xxxxx >> $CYLC_SUITE_RUN_DIR/.service/passphrase"
     [[[remote]]]
         host = wizard

If I run this, I get warnings in the scheduler log:

... WARNING - [client-connect] DENIED hrevilo@wizard:cylc-medssage aa02f40b-37b7-4c19-ba6e-98f3498bc7eb

And the same errors as you in job.err on the job host:

$ cylc cat-log -f e d7 foo.1

2021-07-26T11:46:29+12:00 WARNING - Message send failed, try 1 of 7: Not authorized: https://niwa-1007823l:43019/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-26T11:46:34+12:00 WARNING - Message send failed, try 2 of 7: Not authorized: https://niwa-1007823l:43019/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-26T11:46:39+12:00 WARNING - Message send failed, try 3 of 7: Not authorized: https://niwa-1007823l:43019/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-26T11:46:44+12:00 WARNING - Message send failed, try 4 of 7: Not authorized: https://niwa-1007823l:43019/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-26T11:46:50+12:00 WARNING - Message send failed, try 5 of 7: Not authorized: https://niwa-1007823l:43019/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-26T11:46:55+12:00 WARNING - Message send failed, try 6 of 7: Not authorized: https://niwa-1007823l:43019/put_messages: cylc-message: access type 'private'
   retry in 5.0 seconds, timeout is 30.0
2021-07-26T11:47:18+12:00 WARNING - Message send failed, try 7 of 7: Cannot connect: https://niwa-1007823l:43019/put_messages: suite "d7" already stopped

Now it is very strange that your problem is intermittent. The only potential causes I can think of, which both seem unlikely, are:

  • the “seriously niche” scenario given by @oliver.sanders in response to your earlier problem, which could result in the job connecting to the a different suite (with a different passphrase) listening on the original port
  • something messing with your suite’s .service directory on the job host

Next time it happens, maybe you could take a look at the .service directory on the job host. Check that the suite port is the same as the one that the suite is actually listening on, and check that the passphrase is identical to the one on the scheduler host.

Note Cylc expects that the job host sees the same filesystem as the job submission host. But if that was not the case you wouldn’t be able to any jobs (i.e. not an intermittent problem).

This sounds very similar to the earlier problem of the contact file changing whilst the workflow is running?

Something rather strange is happening here. Only Cylc should read/write to the .service directory and Cylc won’t change these files during the lifetime of the workflow run.

This may be a bit tricky to debug, so elaborating on Hillary’s answer here are some ideas:

  • Something is messing with the .service directory on the job host.
    • The stat command should help identify if this is the case.
    • Something like watch -n 1 stat .service/* would work as a monitoring tool.
    • lsof | grep 'cylc-run/<suite-name>/.service could help identify the culprit.
  • There is a serious lag between the .service files being written and appearing on the filesystem.
    • Given the 30s timeout and 7 retries this seems unlikely.
    • cat .service/* might help determine if this is happening.
    • The pre-script config could be used to dump this info to a file at the start of each job.
    • Perhaps the passphrase/contact files didn’t appear on the filesystem at the same time?
  • The suite is not able to shut down correctly and only partially cleans up after itself.
    • Are you stopping suites with ctrl+c or kill?
    • Are suites crashing unexpectedly?
    • Check the end of the suite log for any nasty looking messages.
  • Multiple runs are interacting in strange ways.
    • This is the “seriously niche” scenario Hillary alluded to.
    • Is the port range being changed from the default?
    • Does the contact file on the workflow host match the one on the job host?

Hi Hilary and Oliver – thanks so much (again) for your help!

Your parenthetical comment is why I mentioned PBS – the tasks in question are definitely running on other nodes than the one on which the Cylc server is running.

To clarify the intermittency a little more: you may remember that I had an architectural issue a month or two ago in which I had a pair of tasks which I wanted to cycle several times during a cycle point, and you had recommended a subsuite. I’ve implemented this, and it works well – except for these message reporting issues I’m having. On this particular HPC, this issue right now is happening with the subsuite and not the main suite. The intermittency is that the subsuite is being run every 12 cycles of the main suite (which is typically in about an hour and a half of wall clock time), and either the subsuite runs perfectly or all the tasks in the subsuite show this issue. In other words, the intermittency is not within a single suite run (if the problem is in a suite run at all, it’s there for all tasks/cycle points in that suite run); instead, the intermittency is between different suite runs (this happens in some suite runs and not in others). This is different from the problem I reported earlier (but that was on a different HPC which has since been decommissioned so I’m hoping it doesn’t come back).

So, I have a subsuite run that’s showing these messages right now; I’ve put the subsuite on hold so I can examine it. I don’t know how to check if the passphrase in .service/passphrase is wrong (the job node sees the same filesystem as the login node the server is running on, so they should have access to the same file); but it’s present, and the timestamp on the file is the same as on the contact file. I’ve also verified that the CYLC_SUITE_PORT in the contact file is the same as the port associated with the process ID for the subsuite’s cylc-run command.

Regarding Oliver’s questions, in the subsuite in question, the stat command shows exactly the same modify and change times for both the contact and passphrase files – so if something monkeyed with them, it monkeyed with both of them, and did so very shortly after creation as the first task in the subsuite doesn’t take long to run and was unable to communicate to the server. I’m not stopping or killing the suite or any tasks within it, and there are no weird messages in the log. I don’t know what to look at regarding the port range being changed; but the nodes the jobs are running on are writing to the same file system that the run directory (and therefore contact file) is in, so they should be seeing the same contact file that the server does – and as noted above, the messages in job.out indicate that the job is trying to use the same port number that the server is listening on.

I know that my job’s task scripts (which are written in Python) are not the actual job scripts, but instead are called by bash functions within the actual job scripts, which are mostly made up of Cylc functions; and I presume that the status change messages are sent via some bash functions that are part of all that infrastructure. Is there something I can do manually, from the command line, to mimic the process of sending a status change message to the server, and which will also give me more output to stdout/stderr?

Thanks!

Oh, one other thing, Hilary – you mention these messages going to job.err. That’s not the case for me. I’m not seeing any job.err files at all. These messages are going to job.out (so presumably they’re in stdout rather than stderr?). I don’t know if there’s any significance to that at all, but since you made a point of mentioning job.err, I wanted to make sure we were on the same page.

Quick thought, are the subsuites being registered with unique names?

If not it might be worth adding $CYLC_TASK_CYCLE_POINT into the name to ensure subsuites from different cycles can’t interact.

I would probably go for a structure like this:

~/cylc-run/
    base/
        main/
            suite.rc
        sub/
            1234/  # cycle point from the main-suite at which the sub-suite is running
                suite.rc
            2345/  # cycle point from the main-suite at which the sub-suite is running
                suite.rc
$ cylc run base/main

Roger that, but note that Cylc can also run background jobs on other hosts.

Ah ha, in that case I wonder if @oliver.sanders has nailed the problem. If successive sub-suites can run concurrently you have to make sure they have different names and hence different run directories. Generally you should ensure uniqueness by incorporating the main suite cycle point in the sub-suite name (that you register on the fly, before launching a subsuite). Hopefully I recommended that at that time - apologies if not!

Unfortunately, I don’t think that’s the solution. I am already putting the main suite cycle point in the subsuite name.

Also, while this is configurable and thus would not necessarily always be true, in the particular run case being discussed here, the frequency with which the subsuite is run (in main suite cycle points) is the same as the main suite max active cycle points – so I’m prevented from having a second copy of the subsuite running at the same time by the runahead limit in the parent suite.