Task subprocess isn't run in correct conda env... Or is it ?! 😵‍💫

Hi everyone,

This is really weird. I’m running Cylc through a conda env containing both cylc-flow and the binaries/packages used in my workflow.

I had some issues at first, because the subprocesses spawned by Cylc did not run with the correct conda environment. But setting conda config --set auto_activate_base to false solved the issue. Or so I thought.

I’m now trying to run a R script, with libraries already installed through conda.

flow.cylc

[[mzml_to_features]]
        script = which Rscript; Rscript $CYLC_WORKFLOW_RUN_DIR/bin/mzml_to_features

mzml_to_features

file.path(R.home("bin"), "R")
library("proFIA")

When I’m in an interactive shell with the conda env activated, both which Rscript and file.path(R.home("bin"), "R") return /Users/elliotfontaine/mambaforge/envs/myenv/lib/R/bin/R

But when run inside the workflow:

which Rscript
# /usr/local/bin/Rscript

file.path(R.home("bin"), "R")
# [1] "/Library/Frameworks/R.framework/Resources/bin/R"

Because of that, Library("proFIA") returns a “package not found”-like error.

Any idea of what is happening here? It clearly happens before going in the R interpreter since even which doesn’t give the correct path. I only have this issue with R, python packages only available in the conda env can be found inside subprocesses.

I’m losing my mind here :dizzy_face:

This is a common misconception, we should document it better, Cylc jobs are not supposed to run within the Cylc Conda environment (or whatever kind of environment you install Cylc into).

This keeps your job environments separate from your Cylc installation and allows you to use different environments for different jobs. Add whatever environment activation script(s) you might have into the task configuration.

E.G. to activate a Conda environment, you could do something like this:

[runtime]
    [[mytask]]
        script = conda run -n <env-name> Rscript $CYLC_WORKFLOW_RUN_DIR/bin/mzml_to_features

If you have multiple tasks that you want to run in the same environment, you could define a Cylc “family” that activates the environment:

[runtime]
    # this is a family, they're defined just like regular tasks
    [[R_ENV]]
        env-script = """
            # NOTE: some conda packages will cause errors during activation.
            # This "set +/-eu" dance suppresses such errors.
            set +eu
            conda activate cylc-8.3.dev
            set -eu
        """

    # this is a task that pulls in the family configuration
    [[mytask_1]]
        inherit = R_ENV
        script = which Rscript

    # this is a task that pulls in the family configuration
    [[mytask_2]]
        inherit = R_ENV
        script = Rscript --version

Note, Cylc is a distributed system, it works so long as the cylc executable is in the $PATH (more info).

1 Like

Hi @elliotfontaine

@oliver.sanders states how you should do it:

… but actually, by default, for local background jobs at least, tasks DO run in the scheduler environment thanks to this platforms setting:

# global.cylc
[platforms]
   [[localhost]]
      clean job submission environment = False  # this is the default !!
      ...

The reason for this is documented here: Global Configuration — Cylc 8.2.3 documentation

You can easily test this by running a tiny workflow after activating some random conda environment:

[scheduling]
    [[graph]]
        R1 = "foo"
[runtime]
    [[foo]]
        # the "cheese" command is only in my "cheese" conda environment
        script = """
            which cheese || echo 'cheese not found'
        """

If I cylc play this after doing conda activate cheese, task foo’s job log will contain <cheese-env-path>/bin/cheese (i.e. foo IS running in the scheduler environment). But if I set clean job submission environment = True, it will print “cheese not found”.

(Note I do not explicitly activate a Cylc environment to run Cylc commands - I have a cylc wrapper in my default PATH as documented here: Installation — Cylc 8.2.3 documentation)

Typically our workflows don’t only run local background jobs, however. Then, regardless of clean job submission environment = False:

  • remote jobs obviously can’t see the local scheduler environment
  • local jobs submitted to a batch scheduler may or may not see it - that’s down to the batch system
1 Like

Thanks everyone for your detailed answers.

As I understand, my cylc install wasn’t optimal. I shouldn’t have to activate a conda env in the interactive shell to find cylc.

I’ve remade a conda environment named cylc containing only cylc-flow and cylc-uiserver.
I decided to follow your advice and use the wrapper script. I actually missed that part of the installation process while reading the documentation, my bad.

I’ve set the wrapper like that:

##############################!!! EDIT ME !!!##################################
# Centrally installed Cylc releases:
CYLC_HOME_ROOT="${CYLC_HOME_ROOT:-/Users/elliotfontaine/mambaforge/envs/cylc}"

# Users can set CYLC_HOME_ROOT_ALT as well (see above), e.g.:
CYLC_HOME_ROOT_ALT=${HOME}/mambaforge/envs

# Global config locations for Cylc 8 & Rose 2 (defaults: /etc/cylc & /etc/rose)
# export CYLC_SITE_CONF_PATH="${CYLC_SITE_CONF_PATH:-/etc/cylc}"
# export ROSE_SITE_CONF_PATH="${ROSE_SITE_CONF_PATH:-/etc/rose}"
###############################################################################

I forgot to mention an important detail: for the moment, I’m planning for the workflow to be used locally. It isn’t that computationnaly intensive. I’m mostly using Cylc for its design philosophy (cyclical workflows).

Now, I’m a bit confused by that.

Do you specifically mean that I should not activate a conda env for cylc itself, or did you mean even for tasks ?

Would something like that be OK in tasks definition, like @oliver.sanders presented?

[runtime]
    # this is a family, they're defined just like regular tasks
    [[R_ENV]]
        env-script = """
            # NOTE: some conda packages will cause errors during activation.
            # This "set +/-eu" dance suppresses such errors.
            set +eu
            conda activate r_conda_env
            set -eu
        """

    # this is a task that pulls in the family configuration
    [[mytask_1]]
        inherit = R_ENV
        script = which Rscript

    # this is a task that pulls in the family configuration
    [[mytask_2]]
        inherit = R_ENV
        script = Rscript --version

But then, what is $CYLC_HOME_ROOT_ALT about ?

Note I do not explicitly activate a Cylc environment to run Cylc commands

Hillary means that he doesn’t need to explicitly activate the Cylc environment because the cylc wrapper script is in the $PATH.

But then, what is $CYLC_HOME_ROOT_ALT about ?

This is for Cylc developers to point the wrapper script at their working copy of the cylc-flow source code, you can leave it unset.

Yes, what Oliver just said also goes for task job environments - you should arrange for the cylc wrapper to be in the default $PATH, for jobs too.

Hi again @elliotfontaine

Note my comments above on clean job submission environment = False apply to shell environments in the general sense, not conda environments specifically. I.e., with that setting, all exported variables in the scheduler’s environment will be available to local background jobs.

However, I’ve been told (I think) that conda activate does more than just set environment variables, in which case your jobs may still need to explicitly activate the environment they need.

Hi everyone,

Just coming back to say that I did go the

[runtime]
    [[TRFP_ENV]]
        env-script = """
            set +eu
            conda activate wf-trfp
            set -eu
        """
    [[raw_to_mzml]]
        inherit = TRFP_ENV
        script = """
            thermorawfileparser --arguments...
        """

route for my workflow tasks. It seems to work just fine for now, but obviously the different Conda environments need to be created before running it.

I launch Cylc via the wrapper script.

2 Likes