[Swift-devel] reproducible problem running under coasters on ranger from communicado

Glen Max Hocky hockyg at uchicago.edu
Tue Apr 28 17:19:31 CDT 2009


I had the following problem this morning and just recreated under mike's login.
(showing him how to run the latest stuff and i wanted to see if this problem 
could be recreated)

This is all with the latest svn version
Hundreds of jobs were in the active state and running on an equiv number of 
cpus on ranger. All of the sudden, all but 100 switched to a failed state. Then 
the run proceeded fairly normally until it crashed with a "coaster failed to start" 
error.

clips of errors below

all logs in 
/home/wilde/oops/swift/output/rangeroutdir.20
coaster logs in 
/home/wilde/oops/swift/output/rangeroutdir.20/coaster_logs

------------------------------------

Progress:  Selecting site:3  Submitted:784  Active:113  Finished successfully:9
Progress:  Selecting site:3  Submitted:512  Active:385  Finished successfully:9
Progress:  Selecting site:3  Submitted:379  Active:518  Finished successfully:9
Progress:  Selecting site:3  Submitted:337  Active:560  Finished successfully:9
Progress:  Selecting site:3  Submitted:337  Active:560  Finished successfully:9
Progress:  Selecting site:3  Submitted:337  Active:560  Finished successfully:9
Progress:  Selecting site:3  Submitted:337  Active:559  Finished successfully:9 
Failed but can retry:1
Progress:  Selecting site:3  Submitted:335  Active:559  Finished successfully:9 
Failed but can retry:3
Progress:  Selecting site:3  Submitted:335  Active:543  Finished successfully:9 
Failed but can retry:19
Progress:  Selecting site:3  Submitted:333  Active:543  Finished successfully:9 
Failed but can retry:21
Progress:  Selecting site:3  Submitted:333  Active:527  Finished successfully:9 
Failed but can retry:37
Progress:  Selecting site:3  Submitted:333  Active:495  Finished successfully:9 
Failed but can retry:69
Progress:  Selecting site:3  Submitted:332  Active:481  Finished successfully:9 
Failed but can retry:84
Progress:  Selecting site:3  Submitted:332  Active:479  Finished successfully:9 
Failed but can retry:86
Progress:  Selecting site:3  Submitted:331  Active:465  Finished successfully:9 
Failed but can retry:101
Progress:  Selecting site:3  Submitted:331  Active:463  Finished successfully:9 
Failed but can retry:103
Progress:  Selecting site:3  Submitted:330  Active:447  Finished successfully:9 
Failed but can retry:120
Progress:  Selecting site:3  Submitted:329  Active:433  Finished successfully:9 
Failed but can retry:135
Progress:  Selecting site:3  Submitted:329  Active:415  Finished successfully:9 
Failed but can retry:153
Progress:  Selecting site:3  Submitted:329  Active:399  Finished successfully:9 
Failed but can retry:169
Progress:  Selecting site:3  Submitted:329  Active:383  Finished successfully:9 
Failed but can retry:185
Progress:  Selecting site:3  Submitted:328  Active:367  Finished successfully:9 
Failed but can retry:202
Progress:  Selecting site:3  Submitted:327  Active:351  Finished successfully:9 
Failed but can retry:219
Progress:  Selecting site:3  Submitted:326  Active:336  Finished successfully:9 
Failed but can retry:235
Progress:  Selecting site:3  Submitted:326  Active:319  Finished successfully:9 
Failed but can retry:252
Progress:  Selecting site:3  Submitted:220  Active:408  Finished successfully:9 
Failed but can retry:269
Progress:  Selecting site:3  Submitted:219  Active:363  Finished successfully:9 
Failed but can retry:315
Progress:  Selecting site:3  Submitted:216  Active:334  Finished successfully:9 
Failed but can retry:347
Progress:  Selecting site:3  Submitted:214  Active:303  Finished successfully:9 
Failed but can retry:380
Progress:  Selecting site:3  Submitted:214  Active:287  Finished successfully:9 
Failed but can retry:396
Progress:  Selecting site:3  Submitted:214  Active:271  Finished successfully:9 
Failed but can retry:412
Progress:  Selecting site:3  Submitted:213  Active:255  Finished successfully:9 
Failed but can retry:429
Progress:  Selecting site:3  Submitted:213  Active:239  Finished successfully:9 
Failed but can retry:445
Progress:  Selecting site:3  Submitted:213  Active:223  Finished successfully:9 
Failed but can retry:461
Progress:  Selecting site:3  Submitted:213  Active:207  Finished successfully:9 
Failed but can retry:477
Progress:  Selecting site:3  Submitted:212  Active:207  Finished successfully:9 
Failed but can retry:478
Progress:  Selecting site:3  Submitted:212  Active:175  Finished successfully:9 
Failed but can retry:510
Progress:  Selecting site:3  Submitted:211  Active:143  Finished successfully:9 
Failed but can retry:543
Progress:  Selecting site:3  Submitted:211  Active:112  Finished successfully:9 
Failed but can retry:574
Progress:  Selecting site:3  Submitted:211  Active:111  Finished successfully:9 
Failed but can retry:575
Progress:  Selecting site:3  Submitted:211  Active:96  Finished successfully:9 
Failed but can retry:590
Progress:  Selecting site:3  Submitted:211  Active:96  Finished successfully:9 
Failed but can retry:590



-----------------------------------
Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/a on 
ranger
Progress:  Submitted:801  Active:44  Finished successfully:61 Failed but can 
retry:3
Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/x on 
ranger
Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/l on 
ranger
Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/v on 
ranger
Progress:  Stage in:1  Submitted:802  Active:42  Checking status:1  Finished 
successfully:61 Failed but can retry:2
Failed to transfer wrapper log from oops-20090428-1642-ils1yrj8/info/3 on 
ranger
Execution failed:
        Exception in runramaSpeed:
Arguments: [input/fasta/T1af7.fasta, 
home/wilde/oops/swift/output/rangeroutdir.20/T1af7/T1af7.ST25.TU200.000
0.secseq, input/native/T1af7.pdb, input/rama/T1af7.rama_map, home/wi
lde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/64/T1af
7.ST25.TU200.0000.0164.pdt, 
home/wilde/oops/swift/output/rangeroutdir.20/T1af7//ST25.TU200/0000/01/
64/T1a
f7.ST25.TU200.0000.0164.rmsd, 164, DEFAULT_INIT_TEMP_=_25, 
TEMP_UPDATE_INTERVAL_=_200, MAX_NUMBER_OF_ANNEALING_STEPS_=_0, 
KILL_TIME_=_30]
Host: ranger
Directory: oops-20090428-1642-ils1yrj8/jobs/3/runramaSpeed-383qd2aj
stderr.txt: 

stdout.txt: 

----

Caused by:
        Failed to start worker: Worker ended prematurely
Cleaning up...
Shutting down service at https://129.114.50.163:49375
Got channel MetaChannel: 3994917 -> GSSSChannel-null(1)
- Done
------------------------------------------



More information about the Swift-devel mailing list