The role of the UI server in production from a user (and permissions) point of view (cylc8)

Questions about cylc-8 usage

The following lists a whole bunch of assumptions/hopes about how Cylc8 might run for us in production. Due to my not being able to have 2 images in one post, I will respond to this with how we use Cylc7 in production to provide context.

  1. All of the relevant machines on this diagram are generally on the same network and can talk to each other.
  2. Cylc8 is installed on all of the blue VMs, including the new light-blue Cylc8 virtual machines (plus as necessarily on the HPCs as well).
  3. Only the workflow machines (in the special box) run suites and submit jobs to PBS and thus to the HPCs.
  4. Systems Administrators directly install suites onto the workflow machines via the command line.
    1. How do the UI servers get connected to installed (and potentially running) suites?
  5. Suites run with the permissions of certain production realm accounts. For example the atmosphere user might “own” 10 suites, and the oceans user might “own” 5 suites etc etc.
  6. JupyterHub can run UI servers locally and on configurable extra VMs (hence the second light blue circle). I am not concerned if the remote UI servers are not possible, but earlier documentation suggested they are/were.
  7. Suite Owners can log into JupyterHub – as themselves - and scan and find all available suites (just like logging into a machine and running gscan) even where they don’t have write-access to those suites (just like with gscan). The can filter by user/workflow-vm/other filterables, or searching for suite name. Suite owners cannot perform edit runs, change the suite.rc.processed via edit functionality or do any other edits on production suites (because they cannot raise their privileges to the realm account).
  8. Suite Owners don’t have to log out and log back in again or clear their cookies to view other suites running as different production users because it’s just read-only (just like with gscan and the cylc GUI).
  9. Support Staff can connect to JupyterHub and scan and find all available suites, with the same filtering as above, and make limited changes (pause, restart, change environment variables in edit runs…) to all the suites they have permission to affect.
  10. Support Staff don’t have to log out and log back in again or clear their cookies to act as different production users.
    1. Ideally we’d be able to use their Active Directory roles as an indication of which production users they can act on behalf of. (Edit authorisation needs to be checked both at the GUI level and again at the cylc daemon level, so the user’s Active Directory roles will have to be passed to the daemon as well, while being as minimally spoofable as possible.)
    2. For example Bob might be able to affect all suites running as atmosphere, while Jane might be able to affect all atmosphere and ocean suites.
    3. It would be good if cylc-8 allows for any equivalent of “user” and “acting-as user” and always check the permissions of the “acting-as user” for authorisation with a default that “user” and “acting-as user” are identical, but allow authentication/authorisation plugins that differentiate
  11. Systems Administrators can do everything Support Users can and more, with full GUI functionality as well using command line access when necessary.
  12. Suite interactions via the command line also check for user authorisation – with the same “user” and “acting-as user” style plugin or equivalent.

From a user (suite-owner, systems administrator, support staff) point of view, given the above, what are the UI servers? Who needs to know about them?

1 Like

Current cylc-7 usage:

At the moment, for cylc-7, our production looks something like the following:

As above, all machines are on the same network and can talk to each other.

Cylc is installed on all of the blue VMs (and the HPC), but only the workflow machines (in the special box) run suites and submit jobs to PBS and thus to the HPCs. Systems Administrators install suites directly onto their choice of workflow VM and set them running (cross-triggering allows for some flexibility as to which workflow VM suites are installed onto and thus gives us some manual load-balancing capacity). Suites run with the permissions of certain production accounts. For example the atmosphere user might “own” 10 suites, and the oceans user might “own” 5 suites etc etc.

Suite Owners (people who provide the expert support for that particular suite) have read-only access to their suites, via logging into the Cylc-GUI machine and using gscan and the list of cylc workflow hosts, then open the UI from there. They have read-only access, as defined by cylc’s idea of read-only (which means they can’t see the suite’s URL for example). The purpose of having a separate GUI machine is that multiple users using the GUI can be quite resource intensive, especially if any of them decide to look at the graph. All Suite Owners can see all running suites.

Support Staff and Systems Administrators connect to the Cylc-Control machine and then elevate their access by changing to the production account that runs the suite (eg oceans_prod). They then run gscan get the list of cylc workflow hosts, then open the UI from there. They can then start/stop/pause suites, to do edit runs etc. All users with access to the Cylc-Control machine can see and affect all running suites (although they may need to log out of one production user and in as another which is a bit annoying).

Systems Administrators might choose to work directly from the command line as well directly on the cylc workflow machines. This is essential if they need to run a script that connects to each suite and pauses it prior to a PBS outage, for example.

I fully understand that how we use cylc-7 has been informed by how cylc-7 behaves. For example we wouldn’t need the cylc-gui or cylc-control machines if the GUI was light-weight and people “playing” with the graph view didn’t risk slowing production suites (for example). I am sure that how we use cylc-8 will also be informed by how cylc-8 behaves, but I suspect we’ll need a similar set of separations.

Hi @jarich,

Thanks for re-posting this here - I’m sure others will be interested (the developers certainly are).

I understand you’re mainly interested in the Cylc 8 architecture, but a few comments on your described Cylc 7 setup before moving on to your follow-up message:

… gives us some manual load-balancing capacity …

Cylc 7 has some built-in load balancing capability: cylc run can automatically select (based on basic load metrics) which of a pool of hosts to start the suite on. (They all have to share the filesystem though).

The purpose of having a separate GUI machine is that multiple users using the GUI can be quite resource intensive, especially if any of them decide to look at the graph.

and

we wouldn’t need the cylc-gui or cylc-control machines if the GUI was light-weight and people “playing” with the graph view didn’t risk slowing production suites (for example).

In Cylc 8 we plan on not showing the entire graph at once, at least not by default. And the “workflow services” (suite server programs in Cylc 7) will be largely protected from GUIs by the UI Servers.

(although they may need to log out of one production user and in as another which is a bit annoying).

(They could share the suite passphrase.)

Hilary

(now to your follow-up post…)

Hi again @jarich … (hmm maybe I got original post and follow-up the wrong way around - Discourse seems upside-down in this respect)

  1. Cylc8 is installed on all of the blue VMs, including the new light-blue Cylc8 virtual machines (plus as necessarily on the HPCs as well).

As an aside, Cylc 8 is not a monolithic codebase like Cylc 7. The new Workflow Service (server program) plus CLI is now called cylc-flow and is just one of several components (also: the hub and proxy, cylc-ui, cylc-uiserver). cylc-flow will only need to be installed on the workflow hosts.

Systems Administrators directly install suites onto the workflow machines via the command line.

This can stay the same. We might conceivably make this sort of functionality available from the web UI eventually, but almost certainly not in the initial releases.

How do the UI servers get connected to installed (and potentially running) suites?

The UI Servers, which run “as the user” (and one per user) are spawned by the Hub, although they can also be started manually by the user (CLI). The UI Servers will be able to start Workflow Services running, or find already-running ones via cylc scan-like ability. Contrary to your diagram, the UI Servers should probably run on the workflow hosts. A user’s UI Server will hold the status data of all of that user’s workflows, taking incremental updates from them as they evolve. UIs will request information from the UI Server (so multiple UIs will not load up the Workflow Services).

am not concerned if the remote UI servers are not possible, but earlier documentation suggested they are/were.

Definitely possible. As per my earlier comment, the central Hub spawns UI Servers as the user, on the workflow hosts (because the UI Server has to be able to start Workflow Services up, not just communicate with them over the network). There are several potential remote spawning mechanisms, that we’ve tested a little (but successfully) already, including ssh and PBS.

Suite Owners can log into JupyterHub – as themselves - and scan and find all available suites (just like logging into a machine and running gscan) even where they don’t have write-access to those suites (just like with gscan). The can filter by user/workflow-vm/other filterables, or searching for suite name.

The intention is for authenticated users to be able to see other users’ workflows that they are authorized to see, but we don’t know yet how we’ll provide that capability. JupyterHub doesn’t provide any UI to see other users’ Jupyter Notebooks, not because it couldn’t do so in principle but because - unlike for Cylc - the Notebooks themselves do not support multi-user access. We have tested that a hub-authenticated user can in fact see other users’ UI Servers, but at the moment you have to manually switch the target user name in the URL. So we still need to figure out how best to do it. JupyterHub supports something called “hub services” that we may be able to use for this.

Suite owners cannot perform edit runs, change the suite.rc.processed via edit functionality or do any other edits on production suites (because they cannot raise their privileges to the realm account).

Correct, no user should be able to do anything that they’re not authorized to do.

Suite Owners don’t have to log out and log back in again or clear their cookies to view other suites running as different production users because it’s just read-only (just like with gscan and the cylc GUI).

Yes, that’s what we’re aiming for.

Support Staff can connect to JupyterHub and scan and find all available suites, with the same filtering as above, and make limited changes (pause, restart, change environment variables in edit runs…) to all the suites they have permission to affect.

Yes, “support staff” will presumably be defined (as far as Cylc is concerned) by what they are authorized to do.

[AD roles and authorization etc.]

The Hub handles authentication for us, via plugins. We’ve only tested PAM so far, but I don’t expect to have a problem with other authentication systems.

We haven’t started work on authorization yet, so this is still speculation, but we hope to have a two-level authorization system. As an authenticated user, the hub should only let you see Workflow Services that you are authorized to see (perhaps via a site auth config file, that could refer to Active Directory roles or whatever). Then, if you are allowed through by the Hub, the suites (/workflow services) will also check that you are authorized for the particular request or command. Initially at least, authorization might be controlled by simple text config files that are read by a Hub authorization service, and the suites (and/or UI Server).

It’s not clear to me what additional capability your “acting-as” user suggestion would give us? I envisage suite owner A granting temporary authorization to other user B by simply altering their own auth config file accordingly. (And of course other user B would have to be able to get through the site/Hub auth first, so sites can restrict what is possible here).

Systems Administrators can do everything Support Users can and more, with full GUI functionality as well using command line access when necessary.

Yes, if authorized to do so as far as Cylc is concerned. E.g. in the Cylc site config file, specify that users in the “admin” group can control all suites.

My speculation on authorization via AD roles probably assumes that we can get relevant group/role membership info back from the authentication service for an authenticated user (I haven’t checked that yet).

Suite interactions via the command line also check for user authorisation – with the same “user” and “acting-as user” style plugin or equivalent.

The CLI will definitely need authorization too, but we may need (or want) to distinguish between suite job client, suite owner, other-user in terms of the mechanism. This is still to be worked out though.

From a user (suite-owner, systems administrator, support staff) point of view, given the above, what are the UI servers? Who needs to know about them?

The UI Servers literally serve the UI. There is one per user, normally (but not necessarily) spawned by the Hub. Each UI Server holds the status data of one more workflows, and takes incremental updates from the workflows as they evolve. The UI gets status data from the UI Server, not from the workflows (although commands have to be passed through to them). We think UI Servers can be made pretty much transparent to users and admins. The Hub spawns (or re-spawns) them on demand, and they don’t need ot exist if/when no one is looking at their workflows.

Hilary

Just to note, there’s a lot of complex work-in-progress info here. Others on the Cylc team may want to chip in to correct me or add to what I’ve said in places.

My thoughts (again, all subject to challenge) …

That’s one way of doing it but we may prefer to have the UI servers running on separate hosts - I think it’s too early to say. The UI servers will need access to the cylc-run directory, preferably via a shared filesystem but we may be able to support ssh access as well?

The CLI will be able to interact with the workflow servers direct but I assume this will only work when you are logged in as the suite owner and if you have access to the cylc-run directory. We also want the CLI to be able to go via the hub in which case it should be possible to interact with others users suites (with appropriate authentication and authorisation) but this may prove tricky?

You don’t “act” as another user. You always remain logged in as yourself. You can interact with another users suites if you have the appropriate authorisation. Any actions you perform on another users suites will always be logged with your user id.

1 Like

That’s one way of doing it but we may prefer to have the UI servers running on separate hosts …

Yes I was a bit loose with the term “workflow hosts” - I just meant, one of the pool of “cylc hosts” that sees the same filesystem (where the suite directories are). However, you could choose to reserve specific hosts just for the UI Servers, and others just for the Workflow Services (/suite server programs).

In principle we could support UI Servers on other hosts that don’t see the suite directories, via ssh as you say, but I’m not aware of any good reason to do that at the moment.

Agreed. I don’t think we should support this initially. However, @matthewrmshin was keen that we design the system such that we could support this in the future.

This is certainly an improvement. :slight_smile:

In our current setup, we run the suites in production as non-user realm accounts (eg oceans_prod) and we don’t share the passphrases liberally although we could. Consequently actions on the suites have thus far required the user performing the action to be able a) log onto the workflow machine that the daemon is running on, and b) sudo to that non-user realm account to perform the required actions. This has meant that identification of who did what requires looking beyond cylc’s data and into sudo logs etc.

But a robust authentication/authorisation system as planned would indeed alleviate that need entirely. If we can allow Bob to do actions A, B and C on this suite, then it doesn’t matter which user the suite is running as; Bob can do their actions. Likewise we might allow Jane to do actions C, D, and E on the same suite; and when we pass the user and host information back through our event handlers, we will be able to easily differentiate whether it was Bob or Jane who did C to that suite at that moment in time. :slight_smile:

Sorry in advance folks (1st post),
I have been scanning around trying to find a thread that fit the issue I am seeing and this was the best I could find. File permissions with the Cylc 8 JupyterHub. We are trying to set up a Cylc 8 instance with a Cylc GUI on a separate node connecting to Cylc-flows that run on our cluster login nodes. Those in turn post jobs to SLURM and run on compute like most folks do. The users cylc run directory is located on the users HOME directory on a separate NetApp instance accessible from all. This is the issue we are seeing with trying to correctly configure permissions…

Does the “cylc hubapp” process get forked with different permissions than a shell running as the user?

The log shows
2023-08-02T08:29:13.600821-04:00 some.url.gov cylc-hub.sh[161208]: File “/usr/lib64/python3.9/pathlib.py”, line 1232, in stat
2023-08-02T08:29:13.600821-04:00 some.url.gov cylc-hub.sh[161208]: return self._accessor.stat(self)
2023-08-02T08:22:25.697313-04:00 some.url.gov cylc-hub.sh[161208]: PermissionError: [Errno 13] Permission denied: ‘/ccs/home/user/cylc-run/test-flow/run1’

The process logging the error is:
user 161208 17592 94 Jul31 ? 1-18:36:07 /autofs/nccs-svm1_afw_sw/afw-system/cylc/8.1.4/bin/python3.9 /autofs/nccs-svm1_afw_sw/afw-system/cylc/8.1.4/bin/cylc hubapp

The file that is showing “Permission denied” is owned by the user running the cylc hubapp, and a shell as that user has no problem reading or modifying that file. The file is
user 161208 17592 94 Jul31 ? 1-18:36:07 /autofs/nccs-svm1_afw_sw/afw-system/cylc/8.1.4/bin/python3.9 /autofs/nccs-svm1_afw_sw/afw-system/cylc/8.1.4/bin/cylc hubapp

What is different about the hubapp access as opposed to a user shell?

Hi,

The cylc hub-app is a simple process which runs as the user, its permissions should be the same as any other process running under the user account.

I’m not quite sure how to debug from that log output, it would be good to see the Jupyter Hub log output (this contains details of errors when starting servers) and, if the server was able to start, the Cylc UI Server log ~/.cylc/uiserver/log (this contains initiation and runtime errors at the Cylc end).

When you’re using Jupyter Hub, servers are spawned on behalf of users by the account that Jupyter Hub is running under. This account requires the permission to run the spawn command as the user in question. Note user accounts must match between the server where Jupyter Hub is running and the server where servers are spawned unless special spawner magic is being used.

There are various spawner implementations to choose from with differing requirements, see the Jupyter Hub docs for more details:

Appreciate the quick response. The guys are looking at Spawners. Could the fact that the cylc-run dir sits on the netApp and is autofs mounted have any bearing on any of this? I’ll do my best to cutdown on the stuff we need to redact so bear with us. This is the error we see occurring about every 2 mins in the log. Any advise regarding this would be welcome. (JupyterServer 1.24) (Cylc v8.1.4)

2023-07-31T11:33:58.280737-04:00 some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.280 CylcHubApp mixins:635] Starting jupyterhub-singleuser server version 4.0.1
2023-07-31T11:33:58.479305-04:00 some.url.gov cylc-hub.sh[17592]: [I 2023-07-31T11:33:58.479 JupyterHub log:191] 200 GET /hub/api (@127.0.0.1) 0.41ms
2023-07-31T11:33:58.479713-04:00 some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.479 CylcHubApp serverapp:2686] Serving notebooks from local directory: /autofs/path/USER
2023-07-31T11:33:58.479713-04:00 some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.479 CylcHubApp serverapp:2686] Jupyter Server 1.24.0 is running at:
2023-07-31T11:33:58.479847-04:00some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.479 CylcHubApp serverapp:2686] http://127.0.0.1:53279/user/USER/cylc
2023-07-31T11:33:58.479847-04:00 some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.479 CylcHubApp serverapp:2686]  or http://127.0.0.1:53279/user/USER/cylc
2023-07-31T11:33:58.479847-04:00 some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.479 CylcHubApp serverapp:2687] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
2023-07-31T11:33:58.487910-04:00 some.url.gov cylc-hub.sh[161208]: [I 2023-07-31 11:33:58.487 CylcHubApp mixins:529] Updating Hub with activity every 300 seconds
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]: [E 2023-07-31 11:33:58.488 CylcUIServer] [Errno 13] Permission denied: '/ccs/home/USER/cylc-run/mvtk-live/run1'
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:    Traceback (most recent call last):
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/uiserver/workflows_mgr.py", line 432, in run
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        await self.update()
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/uiserver/workflows_mgr.py", line 307, in update
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        async for wid, before, after, flow in self._workflow_state_changes():
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/uiserver/workflows_mgr.py", line 185, in _workflow_state_changes
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        async for flow in self._scan_pipe:
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/flow/async_util.py", line 97, in __aiter__
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        async for item in meth(running, completed):
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/flow/async_util.py", line 128, in _ordered
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        ind = task.result()
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/flow/async_util.py", line 152, in _generate
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        async for item in gen.func(*gen.args, **gen.kwargs):
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/flow/network/scan.py", line 275, in scan
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        _scan_subdirs(contents, depth)
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/autofs/path/cylc/8.1.4/lib64/python3.9/site-packages/cylc/flow/network/scan.py", line 233, in _scan_subdirs
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        if subdir.is_dir() and subdir.stem not in EXCLUDE_FILES:
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/usr/lib64/python3.9/pathlib.py", line 1439, in is_dir
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        return S_ISDIR(self.stat().st_mode)
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:      File "/usr/lib64/python3.9/pathlib.py", line 1232, in stat
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:        return self._accessor.stat(self)
2023-07-31T11:33:58.491633-04:00 some.url.gov cylc-hub.sh[161208]:    PermissionError: [Errno 13] Permission denied: '/ccs/home/USER/cylc-run/mvtk-live/run1'
2023-07-31T11:33:58.499864-04:00 some.url.gov cylc-hub.sh[17592]: [I 2023-07-31T11:33:58.499 JupyterHub log:191] 200 POST /hub/api/users/USER/activity (USER@127.0.0.1) 6.90ms

Hi,

That traceback helps. This error is coming from the “scan” routine which Cylc uses to detect workflows by looking in the user’s cylc-run directory. The routine recursively looks through ~/cylc-run looking for things that look like workflows, it’s essentially a fancy way of doing find ~/cylc-run -name flow.cylc.

The routine is failing because it’s unable to access a workflow directory: /ccs/home/USER/cylc-run/mvtk-live/run1

Which is interesting because this means that Cylc was able to access the parent directory /ccs/home/USER/cylc-run/mvtk-live otherwise it wouldn’t have known to look for run1.

This suggests that mvtk-live/run1 has different permissions to mvtk-live.

The PermissionError corresponds to EACCES, EPERM, and ENOTCAPABLE.

Thanks for the insight Oliver,
What is important to note here is that /ccs/home/USER/cylc-run/mvtk-live/run1 is a link to another filesystem that provides access to compute nodes but /ccs/home/USER/cylc-run/mvtk-live and /ccs/home/USER/cylc-run/mvtk-live/run1 (at the other end of the link) are both owned by the same user and are 755

server1:~ # ll -d /ccs/home/User/cylc-run/test-flow

drwxr-xr-x 3 USER USER 4096 Jul 24 15:40 /ccs/home/USER/cylc-run/test-flow

server1:~ # ll -d /ccs/home/USER/cylc-run/test-flow/run1

lrwxrwxrwx 1 USER USER 68 Jul 24 15:40 /ccs/home/USER/cylc-run/test-flow/run1 → /lustre/path/USER/cylc8/cylc-run/test-flow/run1

server1:~ # ll -d /ccs/home/USER/cylc-run/test-flow/run1/

drwxr-xr-x 6 USER USER 33280 Jul 24 15:40 /ccs/home/USER/cylc-run/test-flow/run1/

Does this help identify what we are missing?

Linking to another filesystem like that should be fine (we do something similar).

I’m not sure what’s going wrong here, the error is originating from Path.is_dir, we would expect this call to return True or False, not raise PermissionError. From your traceback the os-level origin of the error is a stat call which is made as a part of the is_dir routine.

Unfortunately, this could be the result of a filesystem/configuration/authentication issue of some form.

This code snippet should be enough to replicate the traceback (without using the Hub / UIS):

# run this code as the user in the Cylc environment,
# from the host where the "cylc hubapp" was launched to replicate
from pathlib import Path

print(Path('/ccs/home/USER/cylc-run/mvtk-live/run1').is_dir())

If you’re able to replicate the error using this snippet, I have one hunch, try resolving the path before running is_dir. This shouldn’t help (it would be a Python bug if it does) but might be worth trying anyway:

print(Path('/ccs/home/USER/cylc-run/mvtk-live/run1').resolve().is_dir())

On the off-chance that does work, this patch may resolve the issue:

diff --git a/cylc/flow/network/scan.py b/cylc/flow/network/scan.py
index 54222a510..51e5ac18b 100644
--- a/cylc/flow/network/scan.py
+++ b/cylc/flow/network/scan.py
@@ -230,10 +230,10 @@ async def scan(
 
     def _scan_subdirs(listing: List[Path], depth: int) -> None:
         for subdir in listing:
-            if subdir.is_dir() and subdir.stem not in EXCLUDE_FILES:
+            if subdir.resolve().is_dir() and subdir.stem not in EXCLUDE_FILES:
                 running.append(
                     asyncio.create_task(
-                        _scandir(subdir, depth + 1)
+                        _scandir(subdir.resolve(), depth + 1)
                     )
                 )
 

Here are the results Oliver…

cylc01:~ # cat /tmp/test1

#!/usr/bin/python3.9

from pathlib import Path

print(Path(‘/ccs/home/jaworsks/cylc-run/test-flow/run1’).is_dir())

print(Path(‘/ccs/home/jaworsks/cylc-run/test-flow/run1’).resolve().is_dir())

cylc01:~ # /tmp/test1

True

True

server1:~ # su - USER -c !!

su - USER -c /tmp/test1

Traceback (most recent call last):

File “/tmp/test1”, line 5, in

print(Path(‘/ccs/home/USER/cylc-run/test-flow/run1’).is_dir())

File “/usr/lib64/python3.9/pathlib.py”, line 1439, in is_dir

return S_ISDIR(self.stat().st_mode)

File “/usr/lib64/python3.9/pathlib.py”, line 1232, in stat

return self._accessor.stat(self)

PermissionError: [Errno 13] Permission denied: ‘/ccs/home/USER/cylc-run/test-flow/run1’

server1:~ # su - USER -c /tmp/test1

True

True

What exactly is this telling me Oliver?

What exactly is this telling me Oliver?

The point of the test is:

  1. To confirm that the error can be reproduced without Jupyter Hub / Server.
  2. To see whether the resolve fixes the issue.

Because Python bails on the first error, you’ll need to put the statement with the resolve above the one without in order to prove/disprove (2).

From what I can see, you’ve hit the PermissionError on server1 but not on cylc1? If so that means the same operation resulted in different outcomes on different servers which suggests there is an issue with the way the filesystem is mounted.

Sorry,
I am doing this second hand…

Let me try again. This is what it should state…

The two runs as USER1 and USER2 – USER1 should not have any permissions to access the file so the error is expected. USER2 is the user that owns the file and was seeing the errors in the work flows – this test shows it can access the file normally outside of cylc, but fails in the hubapp within cylc.

server1:~ # cat /tmp/test1

#!/usr/bin/python3.9

from pathlib import Path

print(Path(‘/ccs/home/USER2/cylc-run/test-flow/run1’).is_dir())

print(Path(‘/ccs/home/USER2/cylc-run/test-flow/run1’).resolve().is_dir())

server1:~ # /tmp/test1

True

True

server1:~ # su - USER1 -c !!

su - USER1 -c /tmp/test1

Traceback (most recent call last):

File “/tmp/test1”, line 5, in

print(Path(‘/ccs/home/USER2/cylc-run/test-flow/run1’).is_dir())

File “/usr/lib64/python3.9/pathlib.py”, line 1439, in is_dir

return S_ISDIR(self.stat().st_mode)

File “/usr/lib64/python3.9/pathlib.py”, line 1232, in stat

return self._accessor.stat(self)

PermissionError: [Errno 13] Permission denied: ‘/ccs/home/USER2/cylc-run/test-flow/run1’

server1:~ # su -USER2 -c /tmp/test1

True

True

Ok, so if I’m understanding correctly, USER2 was reported the error when accessing the cylc hubapp which was launched by Jupyter Hub on their behalf on server1.

If so then the last test shows the operation works when run standalone which suggests that there’s something funny going on somewhere between Jupyter Hub and the routine in question which is causing the problem. This is very confusing.

As a further confirmation, you could try running cylc scan --states=all which calls the same scan routine that the cylc hubapp does which will help narrow down the search a little further (it’s much easier to debug issues with Cylc than issues with JupyterHub so worth double checking that this is definitely a hub/hubapp problem before going down that route).

If it’s a Hub/Hubapp problem, here’s my best guess on how to proceed…

I’m presuming you’re running Jupyter Hub under a privileged account so that it can spawn servers on behalf of users. The cylc hubapp process is launched by the configured Jupyter Hub spawner. If you haven’t configured a spawner for this deployment, the default is jupyterhub.spawner.LocalProcessSpawner (if you have configured a different spawner, you’ll need to check the fine details). As this is the part of the system which spawns servers across accounts, it’s the most likely cause of differing behaviour.

The next step will be to investigate the cylc hubapp process from which the issue was reported to ensure it was spawned as expected. Here are the things I can think of which might differ between a command run directly vs one spawned via Jupyter Hub in descending order of likelyhood:

  • User! Some of the more exotic spawners allow you to run servers under a different user account for use cases where the user accounts on the server you’re spawning on aren’t the same as the user accounts on the server Jupyter Hub is running on. E.G. if you’re using the SSHSpawner, the user could configure a different user account via their SSH config file authenticated via SSH keys.
  • Shell. The LocalProcessSpawner launches the process via a Python Subprocess. This isn’t the same as running the process via Bash (or whichever POSIX shell you’re using). There does appear to be an option to get LocalProcessSpawner to launch a normal shell and get that shell to launch the process if shell startup logic is required.
  • Environment variables. I’m not sure what environment variables could cause a PermissionError to arrise from a stat call, but, if there are any special env vars you need to be set, you can use spawner.kwargs["env"] to configure them.
  • Process group maybe? I’m not sure why that would matter though.

Oliver, appreciate all of the help.

• The hubapp runs as root.
• Our user and group configurations are in ldap
• Examining /proc/ for a user vs a hub process running as that user, we can see that the groups that the user belongs to for the hub application is only the primary user group – it doesn’t contain any of the secondary groups needed to access the filesystem
• This posting appears to match what we’re seeing Jupyterhub Not Recognizing Users' Groups · Issue #1107 · jupyterhub/jupyterhub · GitHub
• We are using the default LocalProcessSpawner
• setup a custom spawner as a subclass of LocalProcessSpawner as suggested in that posting - appears to have fixed the problem