Failed (remote) cylc task fails but server fails to notice

As part of my workflow, I am submitting jobs (via cylc play) to a remote machine with a configuration like:

[[be-mach]]
    cylc path = /home/sci123/miniconda3/envs/cylc-8.0rc3/bin
    job runner = pbs
    install target = localhost
    global init-script = """
       export WORK_SPACE=/path/to/platform/specific/scratch
    """

from a host defined like:

    [[fe-mach]]
        cylc path = /home/sci123/miniconda3/envs/cylc-8.0rc3/bin
        job runner = pbs
        install target = localhost
        global init-script = """
             export WORK_SPACE=/path/to/platform/specific/scratch
        """

however, when the job fails on the be-mach platform, it takes a looong time before the failure is recognized on fe-mach, where I believe the server is running (I think its a server? Sorry if I’m not using the terminology correctly).

Eventually the failure is picked up, and I’ve been digging around the documentation and source code, and it looks like I might be hitting default poll timing of 15 minutes because the default zmq/TCP communication is failing? When I run cylc play without --debug, in the job.err file, I see

CylcError: the workflow is no longer running at fe-machlogin.webaddress:PORTNUM
It has moved to fe-machlogin:PORTNUM

i.e. the webaddress portion is removed?

Running it with --debug, I get:

2022-07-21T23:36:50Z DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': '\nmutation (\n  $wFlows: [WorkflowID]!,\n  $taskJob: String!,\n  $eventTime: String,\n  $messages: [[String]]\n) {\n  message (\n    workflows: $wFlows,\n    taskJob: $taskJob,\n    eventTime: $eventTime,\n    messages: $messages\n  ) {\n    result\n  }\n}\n', 'variables': {'wFlows': ['test-dev/run1'], 'taskJob': '60000101T0000Z/model_run/01', 'eventTime': '2022-07-21T23:36:50Z', 'messages': [['CRITICAL', 'failed/ERR']]}}, 'meta': {'prog': 'message', 'host': 'cmpnode-999', 'comms_method': 'zmq'}}
Traceback (most recent call last):
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/bin/cylc", line 10, in <module>
    sys.exit(main())
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 675, in main
    execute_cmd(command, *cmd_args)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 283, in execute_cmd
    entry_point.resolve()(*args)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/terminal.py", line 226, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/scripts/message.py", line 173, in main
    record_messages(workflow_id, task_job, messages)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/task_message.py", line 88, in record_messages
    send_messages(workflow, task_job, messages, event_time)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/task_message.py", line 129, in send_messages
    pclient('graphql', mutation_kwargs)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client.py", line 253, in serial_request
    self.loop.run_until_complete(task)
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client.py", line 198, in async_request
    self.timeout_handler()
  File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client.py", line 317, in _timeout_handler
    raise CylcError(

where the error text is the same about it not running any more, but moved to a slightly different adddress.

It sounds like I could change it so the communication method is the polling by default, but it also sounds like that is inefficient? It looks to me like I need to do something with the hostnames in the platform configuration to get the zmq communication working, but I’m not certain

This message can occur when a ZMQ connection (the default communication method) times out. I think this has been caused by network configuration issues.

It’s really difficult debugging network issues from a distance. I suspect that hosts on the network are not able to connect to each other using their FQDNs (Fully Qualified Domain Names) and or the FQDN of a host seen from one place is not the same as seen from another? Here’s some information I hope will help in tracking down the issue…

FQDNs

Because Cylc is a distributed system it needs to be able to move around the network. This means each host needs to have a unique hostname and, where applicable, hosts need to be able to see other hosts using this unique hostname.

For safety, by default Cylc uses the FQDN (Fully Qualified Domain Name) of a host when connecting to it. I.E. Cylc is using the value of hostname -f (the longer form with the webaddress bit) rather than the short name. Long story short, each host should be contactable via it’s FQDN from itself and from any other hosts which Cylc needs to connect to it from.

You may find the network configuration needs some small changes to allow hosts to connect to other using the FQDN.

For situations where you’re unable to influence the network configuration of the job hosts, Cylc allows you to choose a different method of host identification. The options are:

  • name (default) - Uses hostname (FQDN).
  • address - Uses the IP address.
  • hardwired (last resort) - Allows you to manually hardcode a name/address for each host.

The address option might be worth a try.

https://cylc.github.io/cylc-doc/latest/html/reference/config/global.html#global.cylc[scheduler][host%20self-identification]

Communication Methods

It sounds like I could change it so the communication method is the polling by default, but it also sounds like that is inefficient?

Cylc has three communication methods:

  • ZMQ (a protocol built on TCP) - The preferred approach. Requires open TCP sockets.
  • SSH - Fallback if opening TCP sockets between hosts is not permitted.
  • Polling - Last resort if TCP/SSH are blocked, uses pull rather than push communication.

In polling mode, Cylc uses ssh to connect to the job host to inspect submitted and running tasks at a configured time interval. This is “pull” rather than “push” communication so updates will be recieved less frequently. These ssh connections will add a little to network load so it’s advised not to configure Cylc to poll too frequently.

If TCP or SSH are permitted these methods are preferable.

https://cylc.github.io/cylc-doc/latest/html/reference/config/global.html#global.cylc[platforms][<platform%20name>]communication%20method

Platform Configs

It looks to me like I need to do something with the hostnames in the platform configuration to get the zmq communication working, but I’m not certain

We’ve written up some example platform configs which might help:

https://cylc.github.io/cylc-doc/latest/html/reference/config/writing-platform-configs.html

If you do not specify hosts then Cylc uses the platform name I.E. the following two examples are equivalent:

[platforms]
    [[foo]]
        hosts = foo

[platforms]
    [[foo]]

to a remote machine with a configuration like from a host defined like:

Note that [platforms] configure the job hosts. The “scheduler” host (the place where the workflow process runs) defaults to the host where cylc play is run but can be configured by run hosts:

https://cylc.github.io/cylc-doc/latest/html/reference/config/global.html#global.cylc[scheduler][run%20hosts]

Note that install target = localhost means that the job hosts shares the same $HOME filesystem as the scheduler host (where the workflow runs). If this is not the case choose an arbitrary name e.g. install target = hpc-filesystem (this tells Cylc that it needs to install the workflow files on this platform).

Hope this helps,
Oliver

Thank you very much for the comprehensive answer @oliver.sanders ! I think this should give me some things to track down to try to figure out the issue. I think @swartn was seeing similar issues on our old system, so its probably related.

RE: the install target, yeah we are still using the hack that is discussed here because all our machines share a home but not where the scratch spaces are, so we make links under cylc-run

It looks like adding:

[scheduler]
    [[host self-identification]]
        method = address
        target = internal.domain.ca

to my global.cylc has potentially fixed this @oliver.sanders ! Thanks for the guidance.

However, I was wondering if you could enlighten me on a few things:

  1. under the default settings, while on the job host, how does it attempt to determine that FQDN of the machine that ran cylc play? It seems like a chicken and an egg type of problem where it can only get the FQDN if it already has some name for it? Or does the job host used some stored name, and then say

    "I want a more secure name, lets try to get it"
    

    and then do something like ssh origname "hostname -f"?

  2. Related to 1., for the CylcError I saw above, would it be the job host that first generated the fe-machlogin.webaddress name, or was it the server host? When I’m on the server host, and run hostname -f, I get the name that doesn’t have the webaddress, but if I’m on the job host and run ssh originname "hostname -f" I get the same thing? So I’m not sure whos making the one with the webaddress portion.

  3. When using the address/target combo, how does cylc use the given domain to figure it out?

Presumably the same goes for job success, not just failure? (The same systems are used to communicate any task job completion or message).

under the default settings, while on the job host, how does it attempt to determine that FQDN of the machine that ran cylc play?

The scheduler (started by cylc play) has to “self-identify” its location to its jobs, so that they know where to report back their status etc. It does this by putting a contact file on the job host, which contains the scheduler hostname or IP address, and port. When the Cylc job wrapper tries to report job status back to the scheduler it just uses the location in the contact file.

The scheduler uses socket.getfqdn() in Python get its own FQDN (which should match hostname -f in the shell). However, sometimes the FQDN returned on the scheduler host is not what’s needed to contact the scheduler host from other hosts on the network. That’s down to your network configuration. Which is why we provide other self-identification methods, including hardwiring if needed.

(I think this answers your second question too?)

That’s done in cylc/flow/hostuserutil.py. According to the module docstring, we are using this method: Get local IP Address with Python | Linux-Support.com

Great thanks @hilary.j.oliver - that clears things up!

Regarding:

Presumably the same goes for job success, not just failure? (The same systems are used to communicate any task job completion or message).

Yes, that is the case - I’m just early in the migration effort, so successes are rare :wink: