Cylc 7.8.4 started with a blank GUI and showed status with "stopped with submitted"

Hi,

I’m running this tutorial suite with Cylc 7.8.4 on a Cray XC50 box.

Needs help for a wierd issue, Cylc started with a blank GUI and showed status with “stopped with submitted”

[cylc]

[scheduling]
    [[dependencies]]
        graph = "foo => BAR"
[runtime]
    [[foo]]
        script = "/usr/bin/sleep 10" # fast
        [[[job]]]
             batch system = background

    [[BAR]]
        [[[job]]]
             batch system = pbs
        [[[remote]]]
             host = $(rose host-select "hpc")
        [[[directives]]]
             -l select=1:ncpus=40:mem=190GB
             -l walltime=00:01:00
             -q = workq
             -W umask=022

    {% for FC in FCTIMES %}
    [[bar_{{FC}}]]
        inherit = BAR
        script = "aprun -n 40 /usr/bin/sleep 10"
        [[[directives]]]
            -N = bar_{{FC}}
    {% endfor %}

This suite could run successfully previoursly.

it just recently failed to run and started with a blank GUI, like this

  1. left lower corner status showed “stopped with submitted”

  2. right lower corner status showed “not connected”

  3. ps showed cylc daemon is still running

    jliu 34801 0.1 0.0 1071716 40324 ? Sl 15:49 0:00 python2 /home/projects/17001770/app/cylc/cylc-flow-7.8.4/bin/cylc-run remote -vvv
    jliu 34820 0.3 0.0 588544 74196 pts/12 Sl 15:49 0:01 python2 /home/projects/17001770/app/cylc/cylc-flow-7.8.4/bin/cylc-gui remote

  4. netstat showed it’s also listening on appropriate port, like this

    jliu@elogin1:remote $ netstat -tulpn | grep 430
    (Not all processes could be identified, non-owned process info
    will not be shown, you would have to be root to see it all.)
    tcp 0 0 10.8.0.5:43054 0.0.0.0:* LISTEN -
    tcp 0 0 10.8.0.5:43032 0.0.0.0:* LISTEN -
    tcp 0 0 10.8.0.5:43041 0.0.0.0:* LISTEN 34801/python2
    tcp 0 0 10.8.0.5:43051 0.0.0.0:* LISTEN -

  5. log/job/1/foo/NN/job.status showed ‘foo’ is successful , but the job.err showed this error message

    2020-03-30T15:49:14+08:00 WARNING - Message send failed, try 1 of 7: Cannot connect: https://elogin1:43041/put_messages: HTTPSConnectionPool(host=‘elogin1’, port=43041): Max retries exceeded with url: /put_messages (Caused by SSLError(SSLError(“bad handshake: SysCallError(104, ‘ECONNRESET’)”,),))
    retry in 5.0 seconds, timeout is 30.0

  6. The rest tasks never were triggered.

  7. if run ‘cylc graph remote’, it could plot the run-time graph without any problem.

  8. Here are suite log

    2020-03-30T15:49:11+08:00 DEBUG - Reading file /home/users/gov/nea/jliu/cylc-run/remote/suite.rc
    2020-03-30T15:49:11+08:00 DEBUG - Processing with Jinja2
    2020-03-30T15:49:11+08:00 DEBUG - Processed configuration dumped: /home/users/gov/nea/jliu/cylc-run/remote/suite.rc.processed
    2020-03-30T15:49:11+08:00 DEBUG - Section already encountered: [cylc]
    2020-03-30T15:49:11+08:00 DEBUG - Expanding [runtime] namespace lists and parameters
    2020-03-30T15:49:11+08:00 DEBUG - Parsing the runtime namespace hierarchy
    2020-03-30T15:49:12+08:00 DEBUG - Parsing [special tasks]
    2020-03-30T15:49:12+08:00 DEBUG - Parsing the dependency graph
    2020-03-30T15:49:12+08:00 DEBUG - Configuring internal queues
    2020-03-30T15:49:12+08:00 INFO - Suite server: url=https://elogin1:43041/ pid=34801
    2020-03-30T15:49:12+08:00 INFO - Run: (re)start=0 log=1
    2020-03-30T15:49:12+08:00 INFO - Cylc version: 7.8.4
    2020-03-30T15:49:12+08:00 INFO - Run mode: live
    2020-03-30T15:49:12+08:00 INFO - Initial point: 1
    2020-03-30T15:49:12+08:00 INFO - Final point: 1
    2020-03-30T15:49:12+08:00 INFO - Cold Start 1
    2020-03-30T15:49:12+08:00 DEBUG - [bar_024.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_054.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_060.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_048.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_045.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_042.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_009.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_027.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_000.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_003.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_021.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_006.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_015.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_012.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_051.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_057.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_018.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_039.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_036.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_030.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - [bar_033.1] -released to the task pool
    2020-03-30T15:49:12+08:00 DEBUG - BEGIN TASK PROCESSING
    2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -waiting => queued
    2020-03-30T15:49:12+08:00 DEBUG - 1 task(s) de-queued
    2020-03-30T15:49:12+08:00 INFO - [foo.1] -submit-num=01, owner@host=elogin1
    2020-03-30T15:49:12+08:00 DEBUG - [‘cylc’, ‘jobs-submit’, ‘–debug’, ‘–’, ‘/home/users/gov/nea /jliu/cylc-run/remote/log/job’] … # will invoke in batches, sizes=[1]
    2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -queued => ready
    2020-03-30T15:49:12+08:00 DEBUG - END TASK PROCESSING (took 0.0182700157166 seconds)
    2020-03-30T15:49:12+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:12+08:00 DEBUG - Loading site/user global config files
    2020-03-30T15:49:12+08:00 DEBUG - [‘cylc’, ‘jobs-submit’, ‘–debug’, ‘–’, ‘/home/users/gov/nea /jliu/cylc-run/remote/log/job’, ‘1/foo/01’]
    2020-03-30T15:49:12+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:13+08:00 DEBUG - [jobs-submit cmd] cylc jobs-submit --debug – /home/users/gov/nea/jliu/cylc-run/remote/log/job 1/foo/01
    [jobs-submit ret_code] 0
    [jobs-submit out]
    [TASK JOB SUMMARY]2020-03-30T15:49:12+08:00|1/foo/01|0|34853
    [TASK JOB COMMAND]2020-03-30T15:49:12+08:00|1/foo/01|[STDOUT] 34853
    2020-03-30T15:49:13+08:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-03-30T15:49:12+08:00 for job(01)
    2020-03-30T15:49:13+08:00 DEBUG - [foo.1] -ready => submitted
    2020-03-30T15:49:13+08:00 INFO - [foo.1] -health check settings: submission timeout=None
    2020-03-30T15:49:13+08:00 DEBUG - BEGIN TASK PROCESSING
    2020-03-30T15:49:13+08:00 DEBUG - 0 task(s) de-queued
    2020-03-30T15:49:13+08:00 DEBUG - [foo.1] -forced spawning
    2020-03-30T15:49:13+08:00 DEBUG - END TASK PROCESSING (took 0.000676870346069 seconds)
    2020-03-30T15:49:13+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:14+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:15+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:16+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:17+08:00 DEBUG - Performing suite health check
    2020-03-30T15:49:18+08:00 DEBUG - Performing suite health check

9 . Here are output of cylc check-software

jliu@elogin1:suite $ cylc check-software
Checking your software...

Individual results:
==========================================================================================
Package (version requirements)                                     Outcome (version found)
==========================================================================================
                               *REQUIRED SOFTWARE*                                   
Python (2.6+, <3)................................FOUND & min. version MET (2.7.17.final.0)
 
         *OPTIONAL SOFTWARE for the GUI & dependency graph visualisation*             
Python:pygraphviz (any)........................................................FOUND (1.5)
graphviz (any)..............................................................FOUND (2.40.1)
Python:pygtk (2.0+)......................................FOUND & min. version MET (2.24.0)

              *OPTIONAL SOFTWARE for the HTTPS communications layer*                  
Python:requests (2.4.2+).................................FOUND & min. version MET (2.22.0)
Python:urllib3 (any)........................................................FOUND (1.25.8)
Python:OpenSSL (any)........................................................FOUND (19.1.0)

               *OPTIONAL SOFTWARE for the configuration templating*                   
Python:EmPy (any)............................................................FOUND (3.3.4)

                  *OPTIONAL SOFTWARE for the HTML documentation*                      
Python:sphinx (1.5.3+)....................................FOUND & min. version MET (1.8.5)
    ==========================================================================================

Summary:
                           ****************************                               
                              Core requirements: ok                                  
                              Full-functionality: ok                                  
                           ****************************                               

I’m guessing it may relate to openssl or urllib3 but not sure.

Any ideas ?

Hi,

Just an update, it’s confirmed that this issue was caused by openssl.

Troubleshoot SSL connections with openssl like this

openssl s_client -connect login1:43041  -CAfile  ~/cylc-run/foo/.service/ssl.cert

It just showed ‘CONNECTED(00000003)’ and hang up instead of printing the ssl certificate, looked like connect to a non-ssl service.

Any ideas how to fix the openssl issue without privilege rights, rebuild openssl and python from scratch ?

Thanks for your time

1 Like

Hi @Jerry,

It’s not clear to me what the problem could be, sorry - I haven’t encountered (or heard other reports of) this sort of thing any time recently.

At NIWA we don’t run suites on our XC50 (it’s just used as a job host, for suites running on CS500 nodes) but I logged in to the XC50 just now and tried it, and it worked fine, but we have pyOpenSSL 16.0.0 on the system, not 19.1.0.

Hopefully someone else will have some ideas…

Hilary

Hmm. Interestingly, my work notebook (Centos 7) has pyOpenSSL 0.13.1 installed on the system via the yum package manager. If I remove that (sudo yum remove pyOpenSSL) then install it again via pip (sudo pip install pyOpenSSL) I get version 19.1.0 (same as yours) and cylc-7.8.4 no longer works. It seemed that Cylc clients thought the server was using unsecured http, which looks suspiciously like your problem.

We might have to figure out why pyOpenSSL 19.1.0 is not compatible with Cylc 7, if that is in fact the problem.

In the meantime you might have to figure out how to use on older version. (I don’t suppose you have both installed? E.g. older one on the system, and 19.1 installed via pip locally?)

Hilary

Hi Hilary,

After changing the communication method to http, cylc works fine.

Here are my guessing

OpenSSL 1.1.x is not back compatible with 1.0.x. pyOpenSSL 19.1.0 may use the new 1.1.x API , but Cylc 7 still use old 1.0.x API, this caused ‘bad handshake’ issue which made Cylc clients failed to talk with the daemon.

Hi @Jerry,

Thanks for raising this, and nice detective work. Your guess sounds reasonable.

I’ve raised an issue on the Cylc repository: https://github.com/cylc/cylc-flow/issues/3546

The Cylc team is pretty strapped for time right now (working on Cylc 8) but this might be an easy fix. Any chance you’d like to take a look at it? If you do, note that this is not relevant to Cylc 8 which is now the master branch of the cylc/cylc-flow repository. You should clone the 7.8.x branch, where we are maintaining Cylc 7, and work on that. (Or you could just use your 7.8.4 release for initial investigations).

Regards,
Hilary

quick question about this - are the SSL issues on the server side or on the client side?

I’m not sure, @Tim_Whitcomb. (I haven’t had a chance to take a look myself yet).

Hilary