Cannot transfer logs from computational nodes to cylc server

gturek · July 15, 2024, 3:29pm

Hello, I have a setup whereby the cylc server submits slurm jobs to a remote host by ssh’ing into a login node. It looks like the copying of the slurm logs back to the cylc server is being tacked on as part of the slurm job itself, but in my case there is a problem: the computational nodes know nothing about the cylc server (The error message is “sh: connect to host fidji05-sihpc.meteo.fr port 22: No route to host”). I have verified this myself with a simple script that only attempts the ssh. How can I can work around this issue?
Thanx
Gaby

hilary.j.oliver · July 16, 2024, 1:01am

Hi @gturek

Are you using Cylc 8?

Remote job log retrieval is done via rsync on the scheduler run host (if you run your workflow with cylc play -no-detach --debug the full rsync command will be printed). So I don’t think that can be the problem you’re seeing. If you can submit a job using ssh, then you should be able to retrieve the job logs too.

The error message is “sh: connect to host fidji05-sihpc.meteo.fr port 22: No route to host”)

Where exactly are you seeing that message? Is it in the job.err log on the remote host?

Does the job status update to running and then succeeded or failed, in the Cylc scheduler? (If not, you have a problem with job status messaging from job host to scheduler host, which might be the case if you configured Cylc to use ssh for messaging, instead of TCP or job polling, for the job platform: Tracking Job Status — Cylc 8.3.0 documentation).

gturek · July 16, 2024, 8:13am

Hi Hilary. Yes, I am using the latest version (8.3.0) The message appears in the job.err log on the remote host.
On the cylc server side the job-activity.log for all tasks has the message “[((‘job-logs-retrieve’, ‘succeeded’), 1) err] File(s) not retrieved: job.out”. On the remote host, the job.out shows job completed, but there is also a job.err file with the following:

+ for func_name in ''\''env_script'\''' ''\''user_env'\''' ''\''pre_script'\''' ''\''script'\''' ''\''post_script'\'''
+ cylc__job__run_inst_func post_script
+ typeset func_name=post_script
+ shift 1
+ typeset -f cylc__job__inst__post_script
ClientError: ssh: connect to host fidjim05-sihpc.meteo.fr port 22: No route to host

I will try invoking the play command with the additional flags

oliver.sanders · July 16, 2024, 8:19am

How have you configured job communication for this platform?

It would be useful to see the platform config, i.e. the output of cylc config -i '[platforms][<platform name>].

gturek · July 16, 2024, 8:56am

Here is the content of my global.cylc on the cylc server

[platforms]
    [[belenos]]
        hosts = belenos
        communication method = ssh
        job runner = slurm
        retrieve job logs = True
        retrieve job logs command = rsync

‘belenos’ is the generic name for the HPC host. ssh to ‘belenos’ lands you in any one of 3 login nodes. Maybe I also need to explicitly name them all?

oliver.sanders · July 16, 2024, 9:04am

This is what is causing Cylc to attempt to SSH back to the scheduler host:

communication method = ssh

The communications method determines how jobs communicate updates back to the scheduler. Cylc offers two push based mechanisms (zmq + ssh) and one pull based mechanism (poll).

zmq (preferred) - requires open TCP ports.
ssh - If you cant open ports, this uses SSH to jump to the scheduler host, then uses zmq to make it the rest of the way.
poll (last resort) - scheduled polling

If the job host cannot see the scheduler host, try zmq or poll.

gturek · July 16, 2024, 9:06am

Hi there, yes, so I have removed the communication method entry thus opting for the zmq default. But no cigar:

Sending DEBUG MODE xtrace to job.xtrace
DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': '\nmutation (\n  $wFlows: [WorkflowID]!,\n  $taskJob: String!,\n  $eventTime: String,\n  $messages: [[String]]\n) {\n  message (\n    workflows: $wFlows,\n    taskJob: $taskJob,\n    eventTime: $eventTime,\n    messages: $messages\n  ) {\n    result\n  }\n}\n', 'variables': {'wFlows': ['glo12_cylc/run1'], 'taskJob': '1/20240403_so_sst/01', 'eventTime': '2024-07-16T09:01:26Z', 'messages': [['INFO', 'started']]}}, 'meta': {'prog': 'message', 'host': 'belenos575.belenoshpc.meteo.fr', 'comms_method': 'zmq'}}
Traceback (most recent call last):
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/bin/cylc", line 10, in <module>
    sys.exit(main())
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 703, in main
    execute_cmd(command, *cmd_args)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 334, in execute_cmd
    entry_point.load()(*args)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/terminal.py", line 282, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/scripts/message.py", line 200, in main
    record_messages(workflow_id, job_id, messages)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/task_message.py", line 89, in record_messages
    send_messages(workflow, job_id, messages, event_time)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/task_message.py", line 130, in send_messages
    pclient('graphql', mutation_kwargs)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/network/client.py", line 120, in serial_request
    loop.run_until_complete(task)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/network/client.py", line 303, in async_request
    raise ClientTimeout(
cylc.flow.exceptions.ClientTimeout: Timeout waiting for server response. This could be due to network or server issues.
* You might want to increase the timeout using the --comms-timeout option;
* or check the workflow log.

Looks like I may have to go with polling?

oliver.sanders · July 16, 2024, 9:11am

If using ZMQ, the scheduler ports will need to be open to the job host.

You can configure the port range using [run hosts]ports.

gturek · July 16, 2024, 9:57am

Unfortunately, the scheduler is not visible to the compute nodes, just the login nodes.

dpmatthews · July 16, 2024, 10:43am

Can you ssh from the compute nodes to the login nodes? If so you could try using the ssh comms method and configuring ssh to use one of the login nodes as a proxy. i.e. in your .ssh/config on the login node add:

Host <cylc-server-name>
  ProxyJump <login-node-name>

gturek · July 16, 2024, 12:36pm

I’ll try a few things and report back here. Thanx and cheers!

gturek · July 17, 2024, 10:17am

So I tried @dpmatthews suggestion and it also didn’t work although I got a different error than “no route to host”:

(cylc) [turekg@belenoslogin2: 01] more job.err
Sending DEBUG MODE xtrace to job.xtrace
DEBUG - running command:
    $ ssh -oBatchMode=yes -oConnectTimeout=10 \
        fidjim05-sihpc.meteo.fr env CYLC_VERSION=8.3.0 \
        CLIENT_COMMS_METH=ssh bash --login -c 'exec "$0" "$@"' \
        cylc client --comms-timeout=300 glo12_cylc/run1 graphql
Traceback (most recent call last):
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/bin/cylc", line 10, in <module>
    sys.exit(main())
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 703, in main
    execute_cmd(command, *cmd_args)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/scripts/cylc.py", line 334, in execute_cmd
    entry_point.load()(*args)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/terminal.py", line 282, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/scripts/message.py", line 200, in main
    record_messages(workflow_id, job_id, messages)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/task_message.py", line 89, in record_messages
    send_messages(workflow, job_id, messages, event_time)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/task_message.py", line 130, in send_messages
    pclient('graphql', mutation_kwargs)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/network/client.py", line 120, in serial_r
equest
    loop.run_until_complete(task)
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/ext/mr/smer/turekg/miniforge3/envs/cylc/lib/python3.9/site-packages/cylc/flow/network/ssh_client.py", line 85, in async_request
    raise ClientError(err, f"return-code={proc.returncode}")
cylc.flow.exceptions.ClientError: /bin/bash: BASH_XTRACEFD: 19: invalid value for trace file descriptor
Connection timed out during banner exchange

Not sure what that it means

dpmatthews · July 17, 2024, 12:02pm

I think it just means the ssh isn’t working. Try running the following within a Slurm job:

ssh -oBatchMode=yes -oConnectTimeout=10 fidjim05-sihpc.meteo.fr hostname

You need to make sure that works.

gturek · July 17, 2024, 1:11pm

Indeed it does not work. Games with subnets. Basically the login nodes have two IP addresses. The compute nodes can see the login node with one IP address, but cannot see the cylc server., The cylc server sees the login node under a different IP address they can see each other.
I can’t see a way to overcome this with either ssh or zmq. Moving the cylc server to a login node is not practical. Unfortunately for me the HPC owners are not at all accomodating so we have to work around their set up as we can. I will try polling next

hilary.j.oliver · July 17, 2024, 11:28pm

Job polling is supported exactly for this sort of situation, where you can’t get the network sorted for pushing messages back from the job platform. That should work - let us know how it goes.

gturek · August 1, 2024, 2:11pm

Seem to work OK, but I am not using it “in anger” yet and I’ll have to see how I manage this in a suite where tasks take anything from a few minutes to a number of hours

oliver.sanders · August 1, 2024, 4:54pm

I am not using it “in anger” yet

We’ve tested this approach on fairly busy systems, I’m aware other sites are using polling. The main impact is to the filesystem itself. Setting the polling intervals sensibly should avoid issues.

tasks take anything from a few minutes to a number of hours

The execution polling intervals can be configured to vary the polling frequency according to time, e.g.

# * poll every 30s up to five times
# * then poll every 2 minutes up to 2 times
# * then poll every 5 mins thereafter
execution polling intervals = 5*PT30S, 2*PT2M, PT5M

If needed, the execution polling intervals can also be configured on a per-task basis.

Topic		Replies	Views
Running on remote Cylc Support	8	776	September 28, 2019
Failed (remote) cylc task fails but server fails to notice Cylc Support	5	489	July 28, 2022
Job.err: CylcError: Cannot determine whether workflow is running on Cylc Support	2	331	March 9, 2023
How to specify listening IP address for CYLC execution host Cylc Support	9	634	September 18, 2020
Cylc 8.0b1 task communication fails on PBS Cylc Support	15	571	June 8, 2021

Cannot transfer logs from computational nodes to cylc server

Related topics