Job.err: CylcError: Cannot determine whether workflow is running on

Hi,

I am trying to install and test Cylc on HPC in KAUST. I am using Cylc version 8.1.2.

  • cylc scheduler runs in the login node itself, specifically from login node cdl5. And job runner is slurm.

  • The compute nodes of HPC can’t access the home directory. So I am setting the following in my global.cylc:

[platforms]
[[localhost]]
job runner = slurm
global init-script = “”"
export CYLC_RUN_DIR=/lustre/scratch/athippp/cylc-run
“”"
[install]
[[symlink dirs]]
[[[localhost]]]
run = /lustre/scratch/athippp

I was trying to run the tutorial example “runtime-introduction”. The jobs are submitted and completed successfully through slurm but I get the following errors in the job.err:

2023-03-09T15:14:32Z DEBUG - $ ssh -oBatchMode=yes -oConnectTimeout=10 cdl5.hpc.kaust.edu.sa env CYLC_VERSION=8.1.2 CYLC_ENV_NAME=cylc bash --login -c ‘exec “$0” “$@”’ cylc psutil # returned 255
Host key verification failed.
cylc.flow.exceptions.CylcError: Cannot determine whether workflow is running on cdl5.hpc.kaust.edu.sa.

And probably because this the task state gets updated as succeeded (polled) after long time.
Any help regarding this appreciated.

Thanks

Hi,

Cylc currently supports three methods for tracking job status:

  • zmq (push notifications via TCP)
  • ssh (push notifications via SSH)
  • poll (pull notifications)

The zmq method is the default and this is what’s causing the error message you’ve noticed. Even when push notifications are enabled, Cylc will still occasionally poll jobs to make it robust against exactly this sort of issue which is why the job was eventually updated as succeeded.

  • The zmq method requires open TCP ports.
  • All push notifications require a functional DNS setup to allow the compute node to find the workflow server (i.e. the login node for your setup) by the hostname it identifies with (e.g. the result of hostname -f).

Unfortunately, neither of these things are a guarantee with HPC platforms. It is very hard for us to diagnose connection issues from a distance, some likely causes are:

  • Login node not visible to compute node due to DNS.
  • Inconsistent DNS entries across nodes.
  • SSH not set up for non-interactive use (e.g. prompts show). Note it’s been reported that in rare cases SSH can be configured to use authentication methods which only work in interactive sessions but fail for non-interactive use. Note if you have no $HOME dir on the compute nodes then SSH will probably fail anyway, you can set $HOME in global init-script though.
  • Required TCP port range is not open between the compute and login nodes.

Please talk to the system administrators to see what can be achieved.

The 255 in your error suggests that SSH is not available or is not set up for non-interactive use. The reason Cylc is SSH’ing in zmq mode is because it failed to connect to the workflow via zmq/tcp, so tried to check whether it was still running via SSH.

If neither zmq or ssh are an option for you, then you can configure more frequent polling intervals.

To configure the communication method see this section in the docs:

https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[platforms][<platform%20name>]communication%20method

The TCP ports the Cylc scheduler listens on (on the login node in your case) are configured here:

https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[scheduler][run%20hosts]ports

Oliver

1 Like

Thanks Oliver. Yes, I will check with HPC administrators here.