As part of my workflow, I am submitting jobs (via cylc play
) to a remote machine with a configuration like:
[[be-mach]]
cylc path = /home/sci123/miniconda3/envs/cylc-8.0rc3/bin
job runner = pbs
install target = localhost
global init-script = """
export WORK_SPACE=/path/to/platform/specific/scratch
"""
from a host defined like:
[[fe-mach]]
cylc path = /home/sci123/miniconda3/envs/cylc-8.0rc3/bin
job runner = pbs
install target = localhost
global init-script = """
export WORK_SPACE=/path/to/platform/specific/scratch
"""
however, when the job fails on the be-mach
platform, it takes a looong time before the failure is recognized on fe-mach
, where I believe the server is running (I think its a server? Sorry if I’m not using the terminology correctly).
Eventually the failure is picked up, and I’ve been digging around the documentation and source code, and it looks like I might be hitting default poll timing of 15 minutes because the default zmq/TCP
communication is failing? When I run cylc play
without --debug
, in the job.err
file, I see
CylcError: the workflow is no longer running at fe-machlogin.webaddress:PORTNUM
It has moved to fe-machlogin:PORTNUM
i.e. the webaddress
portion is removed?
Running it with --debug
, I get:
2022-07-21T23:36:50Z DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': '\nmutation (\n $wFlows: [WorkflowID]!,\n $taskJob: String!,\n $eventTime: String,\n $messages: [[String]]\n) {\n message (\n workflows: $wFlows,\n taskJob: $taskJob,\n eventTime: $eventTime,\n messages: $messages\n ) {\n result\n }\n}\n', 'variables': {'wFlows': ['test-dev/run1'], 'taskJob': '60000101T0000Z/model_run/01', 'eventTime': '2022-07-21T23:36:50Z', 'messages': [['CRITICAL', 'failed/ERR']]}}, 'meta': {'prog': 'message', 'host': 'cmpnode-999', 'comms_method': 'zmq'}}
Traceback (most recent call last):
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/bin/cylc", line 10, in <module>
sys.exit(main())
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 675, in main
execute_cmd(command, *cmd_args)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 283, in execute_cmd
entry_point.resolve()(*args)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/terminal.py", line 226, in wrapper
wrapped_function(*wrapped_args, **wrapped_kwargs)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/scripts/message.py", line 173, in main
record_messages(workflow_id, task_job, messages)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/task_message.py", line 88, in record_messages
send_messages(workflow, task_job, messages, event_time)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/task_message.py", line 129, in send_messages
pclient('graphql', mutation_kwargs)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client.py", line 253, in serial_request
self.loop.run_until_complete(task)
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client.py", line 198, in async_request
self.timeout_handler()
File "/home/sci123/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client.py", line 317, in _timeout_handler
raise CylcError(
where the error text is the same about it not running any more, but moved to a slightly different adddress.
It sounds like I could change it so the communication method is the polling by default, but it also sounds like that is inefficient? It looks to me like I need to do something with the hostnames in the platform configuration to get the zmq communication working, but I’m not certain