Cylc 8.0b1: [Errno 11] Resource temporarily unavailable on NCAR Cheyenne

Trying to run a pretty basic suite on NCAR Cheyenne where I submit a python job to the queue via PBS generates the following errors:

(cylc8b1) bcash@cheyenne5:~/cylc-run/L96> crun L96/run2
                                                               ._.                                                     
                                                               | |                                                     
                                                   ._____._. ._| |_____.                                               
                                                   | .___| | | | | .___|         The Cylc Workflow Engine [8.0b1]      
                                                   | !___| !_! | | !___.           Copyright (C) 2008-2021 NIWA        
                                                   !_____!___. |_!_____!   & British Crown (Met Office) & Contributors.
                                                         .___! |                                                       
                                                         !_____!                                                       
2021-05-20T15:12:27-06:00 ERROR - [Errno 11] Resource temporarily unavailable: '/glade/u/home/bcash/cylc-run/L96/run2/.service/db' -> '/glade/u/home/bcash/cylc-run/L96/run2/log/dbrsqoeuu2'
        Traceback (most recent call last):
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/site-packages/cylc/flow/scheduler.py", line 621, in run
            await self.configure()
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/site-packages/cylc/flow/scheduler.py", line 429, in configure
            self.suite_db_mgr.on_suite_start(self.is_restart)
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/site-packages/cylc/flow/suite_db_mgr.py", line 196, in on_suite_start
            self.copy_pri_to_pub()
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/site-packages/cylc/flow/suite_db_mgr.py", line 128, in copy_pri_to_pub
            copy(self.pri_dao.db_file_name, temp_pub_db_file_name)
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/shutil.py", line 418, in copy
            copyfile(src, dst, follow_symlinks=follow_symlinks)
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/shutil.py", line 275, in copyfile
            _fastcopy_sendfile(fsrc, fdst)
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/shutil.py", line 172, in _fastcopy_sendfile
            raise err
          File "/glade/u/home/bcash/.conda/envs/cylc8b1/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile
            sent = os.sendfile(outfd, infd, offset, blocksize)
        BlockingIOError: [Errno 11] Resource temporarily unavailable:
      '/glade/u/home/bcash/cylc-run/L96/run2/.service/db' -> '/glade/u/home/bcash/cylc-run/L96/run2/log/dbrsqoeuu2'
2021-05-20T15:12:27-06:00 CRITICAL - Suite shutting down
   - [Errno 11] Resource temporarily unavailable:
        '/glade/u/home/bcash/cylc-run/L96/run2/.service/db' -> '/glade/u/home/bcash/cylc-run/L96/run2/log/dbrsqoeuu2'
2021-05-20T15:12:27-06:00 INFO - DONE

My flow.cylc is:

[meta]
     title = "Basic cylc suite for executing L96 model"
 [scheduling]
     [[dependencies]]
         graph = "L96"
 [runtime]
     [[root]]
         platform = cheyenne
     [[L96]]
         script = "python /glade/work/bcash/ssf/L96/L96_roxie.py"
         [[[directives]]]
             -A = UMIN0005

And my global.cylc is:

 [platforms]
     [[cheyenne]]
         job runner = qsub
         job runner command template = qsub %(job)s

I assume I’m making some kind of novice error with 8.0b1 but don’t know what it is. :slight_smile:

Hi @bencash,

The good news: that does not look like user error.

The bad news: it does not look like a Cylc bug either.

Does your platform use the IBM GPFS (Spectrum Scale) filesystem, perchance? Because your error message …:

...
sent = os.sendfile(outfd, infd, offset, blocksize)
BlockingIOError: [Errno 11] Resource temporarily unavailable:
  '~/cylc-run/L96/run2/.service/db' -> '~/cylc-run/L96/run2/log/dbrsqoeuu2'
2021-05-20T15:12:27-06:00 CRITICAL - Suite shutting down

… looks like this filesystem bug: IJ28891: SENDFILE SYS CALL CAN CAUSE BLOCKINGIOERROR

(It’s occurring when Cylc tries to copy the private workflow DB to a temporary file, which will then be atomically renamed to create the public DB).

In which case, time to call in your local HPC system experts :grimacing:

Hilary

Thanks @hilary.j.oliver. @Jim - have you run into this problem on Cheyenne/Casper?

Update: If I start an interactive session on the NCAR Casper system, install a new hello_world suite, and run it there I do not hit this error. Casper is seeing the same file system as Cheyenne - is that consistent with it being the filesystem bug? It seems like it wouldn’t be but I’m definitely out of my depth there.

@hilary.j.oliver - Further update. I reverted to 7.8.3 which is installed centrally on Cheyenne and it ran without this error.

Huh, that is interesting, Cylc 7 and 8 do exactly the same DB copy at start-up.

I think the relevant thing that’s changed between 7.8.3 and 8.0b1 is the Python version (and the Python standard library) … 2.7 to 3.8.

From your traceback above, you can see that Cylc 8.0b1 calls the copy() function from the shutil package. Then, under the hood, shutil calls os.sendfile() … and the Python documentation says that function was new in Python 3.3: os — Miscellaneous operating system interfaces — Python 3.9.5 documentation
So presumably shutil.copy() must do things differently at a lower level in Python 2.7.

If that is what’s going on, you might be able to reveal the bug independently of Cylc: a trivial program using shutil.copy() might run successfully at Python 2.7 and fail at 3.8. The IBM bug link above suggests the problem depends on file size, so you might need to use the actual Cylc DB as the source file (~/cylc-run/L96/run2/.service/db).

@hilary.j.oliver

I got a chance to test this and I can confirm your hypothesis - shutil.copy fails on the db file when I use python of 3.8 or higher (I didn’t test this exhaustively, but 3.8.10 and 3.9.4 both failed). Using 3.7.9 and 2.7 both worked. Using a different file worked for all versions, and this was true whether I used my cylc8b1 environment or a new clean environment that just had 3.9.4.

Looks like it is time to ping NCAR support…

1 Like

@hilary.j.oliver - I got a response from NCAR on this, which leads to a new question.

"Hello Ben,

Thanks for checking with us. Indeed, you have stumbled upon a file system bug that is triggered by recent versions of Python. This is actually why I haven’t installed a version of Python newer than 3.7.9 on the systems, as >= 3.8 will trigger this bug on GPFS file systems like GLADE. Obviously IBM is aware of the issue but it does not look like it has been fixed at present.

For the time being, the only fix is to use a version of Python <= 3.7 (by specifying the Python version at the time you create your conda environments).

Regards,
Brian"

My question then - can I run cylc8 using 3.7.9 or is it currently DOA on GPFS? :frowning:

Cylc 8 requires Python >= 3.7 so you should be good to go, after downgrading Python in your conda environment.

Let us know if that works!

I was able to get my hello world to run. :slight_smile: Still can’t get it to submit to the batch scheduler but that is another problem for another day…

1 Like