This is really weird. I’m running Cylc through a conda env containing both cylc-flow and the binaries/packages used in my workflow.
I had some issues at first, because the subprocesses spawned by Cylc did not run with the correct conda environment. But setting conda config --set auto_activate_base to false solved the issue. Or so I thought.
I’m now trying to run a R script, with libraries already installed through conda.
flow.cylc
[[mzml_to_features]]
script = which Rscript; Rscript $CYLC_WORKFLOW_RUN_DIR/bin/mzml_to_features
mzml_to_features
file.path(R.home("bin"), "R")
library("proFIA")
When I’m in an interactive shell with the conda env activated, both which Rscript and file.path(R.home("bin"), "R") return /Users/elliotfontaine/mambaforge/envs/myenv/lib/R/bin/R
But when run inside the workflow:
which Rscript
# /usr/local/bin/Rscript
file.path(R.home("bin"), "R")
# [1] "/Library/Frameworks/R.framework/Resources/bin/R"
Because of that, Library("proFIA") returns a “package not found”-like error.
Any idea of what is happening here? It clearly happens before going in the R interpreter since even which doesn’t give the correct path. I only have this issue with R, python packages only available in the conda env can be found inside subprocesses.
This is a common misconception, we should document it better, Cylc jobs are not supposed to run within the Cylc Conda environment (or whatever kind of environment you install Cylc into).
This keeps your job environments separate from your Cylc installation and allows you to use different environments for different jobs. Add whatever environment activation script(s) you might have into the task configuration.
E.G. to activate a Conda environment, you could do something like this:
[runtime]
[[mytask]]
script = conda run -n <env-name> Rscript $CYLC_WORKFLOW_RUN_DIR/bin/mzml_to_features
If you have multiple tasks that you want to run in the same environment, you could define a Cylc “family” that activates the environment:
[runtime]
# this is a family, they're defined just like regular tasks
[[R_ENV]]
env-script = """
# NOTE: some conda packages will cause errors during activation.
# This "set +/-eu" dance suppresses such errors.
set +eu
conda activate cylc-8.3.dev
set -eu
"""
# this is a task that pulls in the family configuration
[[mytask_1]]
inherit = R_ENV
script = which Rscript
# this is a task that pulls in the family configuration
[[mytask_2]]
inherit = R_ENV
script = Rscript --version
Note, Cylc is a distributed system, it works so long as the cylc executable is in the $PATH (more info).
You can easily test this by running a tiny workflow after activating some random conda environment:
[scheduling]
[[graph]]
R1 = "foo"
[runtime]
[[foo]]
# the "cheese" command is only in my "cheese" conda environment
script = """
which cheese || echo 'cheese not found'
"""
If I cylc play this after doing conda activate cheese, task foo’s job log will contain <cheese-env-path>/bin/cheese (i.e. foo IS running in the scheduler environment). But if I set clean job submission environment = True, it will print “cheese not found”.
(Note I do not explicitly activate a Cylc environment to run Cylc commands - I have a cylc wrapper in my default PATH as documented here: Installation — Cylc 8.2.3 documentation)
Typically our workflows don’t only run local background jobs, however. Then, regardless of clean job submission environment = False:
remote jobs obviously can’t see the local scheduler environment
local jobs submitted to a batch scheduler may or may not see it - that’s down to the batch system
As I understand, my cylc install wasn’t optimal. I shouldn’t have to activate a conda env in the interactive shell to find cylc.
I’ve remade a conda environment named cylc containing only cylc-flow and cylc-uiserver.
I decided to follow your advice and use the wrapper script. I actually missed that part of the installation process while reading the documentation, my bad.
I’ve set the wrapper like that:
##############################!!! EDIT ME !!!##################################
# Centrally installed Cylc releases:
CYLC_HOME_ROOT="${CYLC_HOME_ROOT:-/Users/elliotfontaine/mambaforge/envs/cylc}"
# Users can set CYLC_HOME_ROOT_ALT as well (see above), e.g.:
CYLC_HOME_ROOT_ALT=${HOME}/mambaforge/envs
# Global config locations for Cylc 8 & Rose 2 (defaults: /etc/cylc & /etc/rose)
# export CYLC_SITE_CONF_PATH="${CYLC_SITE_CONF_PATH:-/etc/cylc}"
# export ROSE_SITE_CONF_PATH="${ROSE_SITE_CONF_PATH:-/etc/rose}"
###############################################################################
I forgot to mention an important detail: for the moment, I’m planning for the workflow to be used locally. It isn’t that computationnaly intensive. I’m mostly using Cylc for its design philosophy (cyclical workflows).
Now, I’m a bit confused by that.
Do you specifically mean that I should not activate a conda env for cylc itself, or did you mean even for tasks ?
Would something like that be OK in tasks definition, like @oliver.sanders presented?
[runtime]
# this is a family, they're defined just like regular tasks
[[R_ENV]]
env-script = """
# NOTE: some conda packages will cause errors during activation.
# This "set +/-eu" dance suppresses such errors.
set +eu
conda activate r_conda_env
set -eu
"""
# this is a task that pulls in the family configuration
[[mytask_1]]
inherit = R_ENV
script = which Rscript
# this is a task that pulls in the family configuration
[[mytask_2]]
inherit = R_ENV
script = Rscript --version
Note my comments above on clean job submission environment = False apply to shell environments in the general sense, not conda environments specifically. I.e., with that setting, all exported variables in the scheduler’s environment will be available to local background jobs.
However, I’ve been told (I think) that conda activate does more than just set environment variables, in which case your jobs may still need to explicitly activate the environment they need.