Hi,
I’m running this tutorial suite with Cylc 7.8.4 on a Cray XC50 box.
Needs help for a wierd issue, Cylc started with a blank GUI and showed status with “stopped with submitted”
[cylc]
[scheduling]
[[dependencies]]
graph = "foo => BAR"
[runtime]
[[foo]]
script = "/usr/bin/sleep 10" # fast
[[[job]]]
batch system = background
[[BAR]]
[[[job]]]
batch system = pbs
[[[remote]]]
host = $(rose host-select "hpc")
[[[directives]]]
-l select=1:ncpus=40:mem=190GB
-l walltime=00:01:00
-q = workq
-W umask=022
{% for FC in FCTIMES %}
[[bar_{{FC}}]]
inherit = BAR
script = "aprun -n 40 /usr/bin/sleep 10"
[[[directives]]]
-N = bar_{{FC}}
{% endfor %}
This suite could run successfully previoursly.
it just recently failed to run and started with a blank GUI, like this
-
left lower corner status showed “stopped with submitted”
-
right lower corner status showed “not connected”
-
ps showed cylc daemon is still running
jliu 34801 0.1 0.0 1071716 40324 ? Sl 15:49 0:00 python2 /home/projects/17001770/app/cylc/cylc-flow-7.8.4/bin/cylc-run remote -vvv
jliu 34820 0.3 0.0 588544 74196 pts/12 Sl 15:49 0:01 python2 /home/projects/17001770/app/cylc/cylc-flow-7.8.4/bin/cylc-gui remote -
netstat showed it’s also listening on appropriate port, like this
jliu@elogin1:remote $ netstat -tulpn | grep 430
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 10.8.0.5:43054 0.0.0.0:* LISTEN -
tcp 0 0 10.8.0.5:43032 0.0.0.0:* LISTEN -
tcp 0 0 10.8.0.5:43041 0.0.0.0:* LISTEN 34801/python2
tcp 0 0 10.8.0.5:43051 0.0.0.0:* LISTEN - -
log/job/1/foo/NN/job.status showed ‘foo’ is successful , but the job.err showed this error message
2020-03-30T15:49:14+08:00 WARNING - Message send failed, try 1 of 7: Cannot connect: https://elogin1:43041/put_messages: HTTPSConnectionPool(host=‘elogin1’, port=43041): Max retries exceeded with url: /put_messages (Caused by SSLError(SSLError(“bad handshake: SysCallError(104, ‘ECONNRESET’)”,),))
retry in 5.0 seconds, timeout is 30.0 -
The rest tasks never were triggered.
-
if run ‘cylc graph remote’, it could plot the run-time graph without any problem.
-
Here are suite log
2020-03-30T15:49:11+08:00 DEBUG - Reading file /home/users/gov/nea/jliu/cylc-run/remote/suite.rc
2020-03-30T15:49:11+08:00 DEBUG - Processing with Jinja2
2020-03-30T15:49:11+08:00 DEBUG - Processed configuration dumped: /home/users/gov/nea/jliu/cylc-run/remote/suite.rc.processed
2020-03-30T15:49:11+08:00 DEBUG - Section already encountered: [cylc]
2020-03-30T15:49:11+08:00 DEBUG - Expanding [runtime] namespace lists and parameters
2020-03-30T15:49:11+08:00 DEBUG - Parsing the runtime namespace hierarchy
2020-03-30T15:49:12+08:00 DEBUG - Parsing [special tasks]
2020-03-30T15:49:12+08:00 DEBUG - Parsing the dependency graph
2020-03-30T15:49:12+08:00 DEBUG - Configuring internal queues
2020-03-30T15:49:12+08:00 INFO - Suite server: url=https://elogin1:43041/ pid=34801
2020-03-30T15:49:12+08:00 INFO - Run: (re)start=0 log=1
2020-03-30T15:49:12+08:00 INFO - Cylc version: 7.8.4
2020-03-30T15:49:12+08:00 INFO - Run mode: live
2020-03-30T15:49:12+08:00 INFO - Initial point: 1
2020-03-30T15:49:12+08:00 INFO - Final point: 1
2020-03-30T15:49:12+08:00 INFO - Cold Start 1
2020-03-30T15:49:12+08:00 DEBUG - [bar_024.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_054.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_060.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_048.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_045.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_042.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_009.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_027.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_000.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_003.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_021.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_006.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_015.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_012.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_051.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_057.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_018.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_039.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_036.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_030.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_033.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - BEGIN TASK PROCESSING
2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -waiting => queued
2020-03-30T15:49:12+08:00 DEBUG - 1 task(s) de-queued
2020-03-30T15:49:12+08:00 INFO - [foo.1] -submit-num=01, owner@host=elogin1
2020-03-30T15:49:12+08:00 DEBUG - [‘cylc’, ‘jobs-submit’, ‘–debug’, ‘–’, ‘/home/users/gov/nea /jliu/cylc-run/remote/log/job’] … # will invoke in batches, sizes=[1]
2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -queued => ready
2020-03-30T15:49:12+08:00 DEBUG - END TASK PROCESSING (took 0.0182700157166 seconds)
2020-03-30T15:49:12+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:12+08:00 DEBUG - Loading site/user global config files
2020-03-30T15:49:12+08:00 DEBUG - [‘cylc’, ‘jobs-submit’, ‘–debug’, ‘–’, ‘/home/users/gov/nea /jliu/cylc-run/remote/log/job’, ‘1/foo/01’]
2020-03-30T15:49:12+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:13+08:00 DEBUG - [jobs-submit cmd] cylc jobs-submit --debug – /home/users/gov/nea/jliu/cylc-run/remote/log/job 1/foo/01
[jobs-submit ret_code] 0
[jobs-submit out]
[TASK JOB SUMMARY]2020-03-30T15:49:12+08:00|1/foo/01|0|34853
[TASK JOB COMMAND]2020-03-30T15:49:12+08:00|1/foo/01|[STDOUT] 34853
2020-03-30T15:49:13+08:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-03-30T15:49:12+08:00 for job(01)
2020-03-30T15:49:13+08:00 DEBUG - [foo.1] -ready => submitted
2020-03-30T15:49:13+08:00 INFO - [foo.1] -health check settings: submission timeout=None
2020-03-30T15:49:13+08:00 DEBUG - BEGIN TASK PROCESSING
2020-03-30T15:49:13+08:00 DEBUG - 0 task(s) de-queued
2020-03-30T15:49:13+08:00 DEBUG - [foo.1] -forced spawning
2020-03-30T15:49:13+08:00 DEBUG - END TASK PROCESSING (took 0.000676870346069 seconds)
2020-03-30T15:49:13+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:14+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:15+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:16+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:17+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:18+08:00 DEBUG - Performing suite health check
9 . Here are output of cylc check-software
jliu@elogin1:suite $ cylc check-software
Checking your software...
Individual results:
==========================================================================================
Package (version requirements) Outcome (version found)
==========================================================================================
*REQUIRED SOFTWARE*
Python (2.6+, <3)................................FOUND & min. version MET (2.7.17.final.0)
*OPTIONAL SOFTWARE for the GUI & dependency graph visualisation*
Python:pygraphviz (any)........................................................FOUND (1.5)
graphviz (any)..............................................................FOUND (2.40.1)
Python:pygtk (2.0+)......................................FOUND & min. version MET (2.24.0)
*OPTIONAL SOFTWARE for the HTTPS communications layer*
Python:requests (2.4.2+).................................FOUND & min. version MET (2.22.0)
Python:urllib3 (any)........................................................FOUND (1.25.8)
Python:OpenSSL (any)........................................................FOUND (19.1.0)
*OPTIONAL SOFTWARE for the configuration templating*
Python:EmPy (any)............................................................FOUND (3.3.4)
*OPTIONAL SOFTWARE for the HTML documentation*
Python:sphinx (1.5.3+)....................................FOUND & min. version MET (1.8.5)
==========================================================================================
Summary:
****************************
Core requirements: ok
Full-functionality: ok
****************************
I’m guessing it may relate to openssl or urllib3 but not sure.
Any ideas ?