REMOTE INIT FAILED error on pbs but manual submission works?

I am stuck trying to submit a basic suite via pbs on NCAR Cheyenne. If I submit via cylc play, I get the following message in job-activity.log:

[jobs-submit cmd] (init cheyenne)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
[jobs-submit cmd] (remote init)
[jobs-submit ret_code] 1

However, if I submit the job file that was created manually via ‘qsub job’ it runs fine. The suite will also execute correctly if I comment out the ‘platform’ line in my flow.cylc and run it on the login node. So the error seems confined to interacting with the batch system.

My flow.cylc is:

[meta]
title = “Basic cylc suite for executing L96 model”
[scheduling]
[[dependencies]]
graph = “L96”
[runtime]
[[root]]
platform = cheyenne
execution time limit = PT10M
[[L96]]
script = “python /glade/work/bcash/ssf/L96/L96_roxie.py”
[[[directives]]]
-q = regular
-A = UMIN0005
-l select=1:mpiprocs=1:ncpus=1
-M = bcash@gmu.edu

And my global.cylc is:
[platforms]
[[cheyenne]]
job runner = pbs

Hello,

Unfortunately that job activity log output doesn’t contain any information about why the remote init failed. The error should have been written to the workflow log log/suite/log (soon to be log/workflow/log).

Taking a guess at the problem, are you able SSH directly into “cheyenne” from the Scheduler host (i.e. ssh cheyenne hostname) or do you have to go via a login node? If login nodes are required they will need to be configured like so:

[platforms]
    [[cheyene]]
        hosts = cheyenne-01, cheyenne-02  # host will be picked at random from this list
        job runner = pbs

Otherwise if the problem is still not evident running the workflow in debug mode might yield some insights (cylc play <flow> --debug). The command that Cylc runs to initiate the “remote init” should be logged there (search for remote-init).

Thanks! I had’t spotted the information in log/suite - I reran with debug and it looks like it is what you suspected:

2021-06-03T13:08:08-06:00 DEBUG - [‘ssh’, ‘-oBatchMode=yes’, ‘-oConnectTimeout=10’, ‘cheyenne’, ‘env’, ‘CYLC_VERSION=8.0b1’, ‘bash’, ‘–login’, ‘-c’, ‘‘exec “$0” “$@”’’, ‘cylc’, ‘remote-init’, ‘–debug’, ‘cheyenne’, ‘$HOME/cylc-run/debug_submit/run1’]
2021-06-03T13:08:08-06:00 ERROR - cheyenne: initialisation did not complete:
COMMAND FAILED (255): ssh -oBatchMode=yes -oConnectTimeout=10 cheyenne env CYLC_VERSION=8.0b1 bash --login -c ‘’"’"‘exec “$0” "$@ "’"’"’’ cylc remote-init --debug cheyenne ‘$HOME/cylc-run/debug_submit/run1’
COMMAND STDERR: ssh: Could not resolve hostname cheyenne: Name or service not known
2021-06-03T13:08:08-06:00 DEBUG - BEGIN TASK PROCESSING
2021-06-03T13:08:08-06:00 INFO - [L96.1] -submit-num=01, host=cheyenne
2021-06-03T13:08:08-06:00 ERROR - [jobs-submit cmd] (init cheyenne)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED

I’m a bit confused though - I am already on a login node when I ‘cylc play’, and if I try ‘ssh cheyenne1 hostname’ it asks me for a password. Do I need to setup passwordless ssh to the login nodes to have submission go through?

Update: passwordless ssh won’t work because 2FA is still required. @Jim it seems like you have solved this for cheyenne and casper, what step am I missing here?

But presumably you can’t submit to PBS from the login node that you’re already on, or if you can that would result in the job running (or attempting to run) in the wrong place?

If you need to ssh to a remote platform to submit jobs, then the Cylc scheduler can do that for you, but yes you have to have passwordless ssh set up.

Let’s see what @Jim says, but if 2FA is compulsory you might be able to get away with manually starting a persistent ssh tunnel to the job submission host.

If you are just trying to submit jobs on the same host where the cylc scheduler is running then you need to configure your platform to use localhost. i.e.

[platforms]
    [[cheyenne]]
        job runner = pbs
        install target = localhost
        hosts = localhost

If I understand the question correctly, you do in fact submit to PBS from the login node. If I am working on Cheyenne, I access the machine via a login node and unless I start up an interactive session on the compute nodes everything I do is on that node. That includes submitting to the batch system, which is just 'qsub '. I never directly access the compute nodes - the scheduler is in charge of that.

Running via 7.8.3, which is centrally installed at NCAR, appears to work without a hitch. @Jim is also still using cylc7. Has something changed between 7 and 8 that would change the interaction with the job scheduler? Or is this likely to be an issue with how it is installed?

@bencash - you need to set up your cheyenne platform to submit locally as I’ve shown above. If you don’t specify hosts for a platform the default is to submit to a host of the same name as the platform which isn’t what you want in this case.

Thanks! I’ve made that modification and it seems to have worked. I’ll keep poking at it this morning to confirm.