Okay, I posted about this error here and have been attempted to figure it out, but no dice so far.
Summary of the problem
I’m trying to set up a platform configuration to run codes on internal machines with the following details:
- there are 4 internal machines that all share a home directory
- two of them have access to raw PBS commands (the “back end” machines), while the other two only have access to a wrapper around them (the “front end” machines - all four machines have access to this wrapper (called
jobctl
which was implemented to allow for remote submission)
As of now, I am trying get the simple flow to run:
[scheduler]
UTC mode = True
[scheduling]
# Stop the workflow 6 hours after the initial cycle point.
initial cycle point = 2000-01-01T00
final cycle point = +PT6H
[[graph]]
PT1H = """
find_diamonds => sell_diamonds
sell_diamonds[-PT1H] => find_diamonds
"""
[runtime]
[[find_diamonds]]
platform = ppp3
script = find_diamonds.sh
[[[directives]]]
-l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
-q = development
[[sell_diamonds]]
platform = ppp3
script = sell_diamonds.sh
[[[directives]]]
-l select=1:ncpus=1:mem=15gb:res_image=eccc/eccc_all_ppp_ubuntu-18.04-amd64_latest
-q = development
where the ppp3
platform is defined as
[platforms]
[[ppp3]]
cylc path = /home/rcs001/miniconda3/envs/cylc-8.0rc1/bin
install target = ppp3
hosts = eccc1-ppp3, eccc2-ppp3, eccc3-ppp3
job runner = jobctl
and jobctl.py
is:
from cylc.flow.job_runner_handlers.pbs import PBSHandler
class JOBCTLHandler(PBSHandler):
"""
Given that jobctl just is a wrapper around PBS, we _should_
only need to alter the commands used as the PBS directives
should be supported
"""
POLL_CMD = "jobst"
KILL_CMD_TMPL = "jobdel '%(job_id)s'"
SUBMIT_CMD_TMPL = "jobsub '%(jobs)s'"
JOB_RUNNER_HANDLER = JOBCTLHandler()
when I try to install/play the flow, I get the following error:
STDOUT:
REMOTE INIT FAILED
Unexpected key directory exists: /home/rcs001/cylc-run/batch-test/run12/.service/client_public_keys Check global.cylc install target is configured correctly for this platform.
After wrestling with this for a while, thinking it was a problem with the way the ssh
command was working, I realized that this might be caused by the fact that these machines (including the workflow host) all share a $HOME
directory, and thus the cylc-run
directory already exists, which would explain why its failing due to the client_public_keys
file already existing?
I tried looking at the source for cylc remote-init
and it looks to me like its trying to effectively install files on the remote host? Looking through the platform config documentation, it looks like you can define a shared filesystem, but the example seems to assume that the shared filesystem is outside the $HOME
directory, and thus the cylc-run
directory still needs to be created?
Is there a way I can setup the configuration so I can use remote machines/submission, but make it know the $HOME
directories are shared?