I’m running this tutorial suite with Cylc 7.8.4 on a Cray XC50 box.
Needs help for a wierd issue, Cylc started with a blank GUI and showed status with “stopped with submitted”
graph = "foo => BAR"
script = "/usr/bin/sleep 10" # fast
batch system = background
batch system = pbs
host = $(rose host-select "hpc")
-l select=1:ncpus=40:mem=190GB
-l walltime=00:01:00
-q = workq
-W umask=022
{% for FC in FCTIMES %}
inherit = BAR
script = "aprun -n 40 /usr/bin/sleep 10"
-N = bar_{{FC}}
{% endfor %}
This suite could run successfully previoursly.
it just recently failed to run and started with a blank GUI, like this
left lower corner status showed “stopped with submitted”
right lower corner status showed “not connected”
ps showed cylc daemon is still running
jliu 34801 0.1 0.0 1071716 40324 ? Sl 15:49 0:00 python2 /home/projects/17001770/app/cylc/cylc-flow-7.8.4/bin/cylc-run remote -vvv
jliu 34820 0.3 0.0 588544 74196 pts/12 Sl 15:49 0:01 python2 /home/projects/17001770/app/cylc/cylc-flow-7.8.4/bin/cylc-gui remote -
netstat showed it’s also listening on appropriate port, like this
jliu@elogin1:remote $ netstat -tulpn | grep 430
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0* LISTEN -
tcp 0 0* LISTEN -
tcp 0 0* LISTEN 34801/python2
tcp 0 0* LISTEN - -
log/job/1/foo/NN/job.status showed ‘foo’ is successful , but the job.err showed this error message
2020-03-30T15:49:14+08:00 WARNING - Message send failed, try 1 of 7: Cannot connect: https://elogin1:43041/put_messages: HTTPSConnectionPool(host=‘elogin1’, port=43041): Max retries exceeded with url: /put_messages (Caused by SSLError(SSLError(“bad handshake: SysCallError(104, ‘ECONNRESET’)”,),))
retry in 5.0 seconds, timeout is 30.0 -
The rest tasks never were triggered.
if run ‘cylc graph remote’, it could plot the run-time graph without any problem.
Here are suite log
2020-03-30T15:49:11+08:00 DEBUG - Reading file /home/users/gov/nea/jliu/cylc-run/remote/suite.rc
2020-03-30T15:49:11+08:00 DEBUG - Processing with Jinja2
2020-03-30T15:49:11+08:00 DEBUG - Processed configuration dumped: /home/users/gov/nea/jliu/cylc-run/remote/suite.rc.processed
2020-03-30T15:49:11+08:00 DEBUG - Section already encountered: [cylc]
2020-03-30T15:49:11+08:00 DEBUG - Expanding [runtime] namespace lists and parameters
2020-03-30T15:49:11+08:00 DEBUG - Parsing the runtime namespace hierarchy
2020-03-30T15:49:12+08:00 DEBUG - Parsing [special tasks]
2020-03-30T15:49:12+08:00 DEBUG - Parsing the dependency graph
2020-03-30T15:49:12+08:00 DEBUG - Configuring internal queues
2020-03-30T15:49:12+08:00 INFO - Suite server: url=https://elogin1:43041/ pid=34801
2020-03-30T15:49:12+08:00 INFO - Run: (re)start=0 log=1
2020-03-30T15:49:12+08:00 INFO - Cylc version: 7.8.4
2020-03-30T15:49:12+08:00 INFO - Run mode: live
2020-03-30T15:49:12+08:00 INFO - Initial point: 1
2020-03-30T15:49:12+08:00 INFO - Final point: 1
2020-03-30T15:49:12+08:00 INFO - Cold Start 1
2020-03-30T15:49:12+08:00 DEBUG - [bar_024.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_054.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_060.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_048.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_045.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_042.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_009.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_027.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_000.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_003.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_021.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_006.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_015.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_012.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_051.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_057.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_018.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_039.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_036.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_030.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - [bar_033.1] -released to the task pool
2020-03-30T15:49:12+08:00 DEBUG - BEGIN TASK PROCESSING
2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -waiting => queued
2020-03-30T15:49:12+08:00 DEBUG - 1 task(s) de-queued
2020-03-30T15:49:12+08:00 INFO - [foo.1] -submit-num=01, owner@host=elogin1
2020-03-30T15:49:12+08:00 DEBUG - [‘cylc’, ‘jobs-submit’, ‘–debug’, ‘–’, ‘/home/users/gov/nea /jliu/cylc-run/remote/log/job’] … # will invoke in batches, sizes=[1]
2020-03-30T15:49:12+08:00 DEBUG - [foo.1] -queued => ready
2020-03-30T15:49:12+08:00 DEBUG - END TASK PROCESSING (took 0.0182700157166 seconds)
2020-03-30T15:49:12+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:12+08:00 DEBUG - Loading site/user global config files
2020-03-30T15:49:12+08:00 DEBUG - [‘cylc’, ‘jobs-submit’, ‘–debug’, ‘–’, ‘/home/users/gov/nea /jliu/cylc-run/remote/log/job’, ‘1/foo/01’]
2020-03-30T15:49:12+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:13+08:00 DEBUG - [jobs-submit cmd] cylc jobs-submit --debug – /home/users/gov/nea/jliu/cylc-run/remote/log/job 1/foo/01
[jobs-submit ret_code] 0
[jobs-submit out]
[TASK JOB SUMMARY]2020-03-30T15:49:12+08:00|1/foo/01|0|34853
[TASK JOB COMMAND]2020-03-30T15:49:12+08:00|1/foo/01|[STDOUT] 34853
2020-03-30T15:49:13+08:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-03-30T15:49:12+08:00 for job(01)
2020-03-30T15:49:13+08:00 DEBUG - [foo.1] -ready => submitted
2020-03-30T15:49:13+08:00 INFO - [foo.1] -health check settings: submission timeout=None
2020-03-30T15:49:13+08:00 DEBUG - BEGIN TASK PROCESSING
2020-03-30T15:49:13+08:00 DEBUG - 0 task(s) de-queued
2020-03-30T15:49:13+08:00 DEBUG - [foo.1] -forced spawning
2020-03-30T15:49:13+08:00 DEBUG - END TASK PROCESSING (took 0.000676870346069 seconds)
2020-03-30T15:49:13+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:14+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:15+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:16+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:17+08:00 DEBUG - Performing suite health check
2020-03-30T15:49:18+08:00 DEBUG - Performing suite health check
9 . Here are output of cylc check-software
jliu@elogin1:suite $ cylc check-software
Checking your software...
Individual results:
Package (version requirements) Outcome (version found)
Python (2.6+, <3)................................FOUND & min. version MET (2.7.17.final.0)
*OPTIONAL SOFTWARE for the GUI & dependency graph visualisation*
Python:pygraphviz (any)........................................................FOUND (1.5)
graphviz (any)..............................................................FOUND (2.40.1)
Python:pygtk (2.0+)......................................FOUND & min. version MET (2.24.0)
*OPTIONAL SOFTWARE for the HTTPS communications layer*
Python:requests (2.4.2+).................................FOUND & min. version MET (2.22.0)
Python:urllib3 (any)........................................................FOUND (1.25.8)
Python:OpenSSL (any)........................................................FOUND (19.1.0)
*OPTIONAL SOFTWARE for the configuration templating*
Python:EmPy (any)............................................................FOUND (3.3.4)
*OPTIONAL SOFTWARE for the HTML documentation*
Python:sphinx (1.5.3+)....................................FOUND & min. version MET (1.8.5)
Core requirements: ok
Full-functionality: ok
I’m guessing it may relate to openssl or urllib3 but not sure.
Any ideas ?