[Swift-devel] problem with max num jobs, array entries?

Mihael Hategan hategan at mcs.anl.gov
Mon Mar 7 12:34:51 CST 2011


Usually, hangs occur for two major reasons:
1. a java deadlock. You can find out whether this is the case by getting
a stack dump of the swift process with jstack.
2. a swift/karajan deadlock. I added some code yesterday to trunk to
detect these and dump the situation to the log.

Apart from that, there is also the possibility of a silent OOM, but I
doubt it is the case with 100 jobs.

Mihael

On Mon, 2011-03-07 at 12:31 -0500, Glen Hocky wrote:
> hey Mike, devs
> 
> i was wondering if you could help me track something down. i may not
> have noticed this before because of the way I was running my jobs but
> i'm having a problem running more than ~100 jobs w/ my swift script
> (with pbs or pbs+coasters). it just hangs with 
> "Progress:                            "
> "Progress:                            "
> "Progress:                            "
> 
> 
> in the swift log it just stalls at this point
> ...
> 2011-03-07 10:26:53,280-0600 INFO  SetFieldValue Set: force=FALSE
> 2011-03-07 10:26:53,284-0600 INFO  VDLFunction FUNCTION: arg()
> 2011-03-07 10:26:53,285-0600 INFO  VDLFunction FUNCTION: toint()
> 2011-03-07 10:26:53,285-0600 INFO  SetFieldValue Set: printfreq=500
> 2011-03-07 10:26:53,285-0600 INFO  VDLFunction FUNCTION: arg()
> 2011-03-07 10:26:53,285-0600 INFO  VDLFunction FUNCTION: toint()
> 2011-03-07 10:26:53,285-0600 INFO  SetFieldValue Set: nmodels=5
> 2011-03-07 10:26:53,286-0600 INFO  VDLFunction FUNCTION: arg()
> 2011-03-07 10:26:53,286-0600 INFO  VDLFunction FUNCTION: toint()
> 2011-03-07 10:26:53,286-0600 INFO  SetFieldValue Set: nsub=20
> 
> 
> 
> 
> whereas when i decrease the number of total jobs it goes to
> ...
> 2011-03-07 11:04:34,001-0600 INFO  SetFieldValue Set: nmodels=4
> 2011-03-07 11:04:34,001-0600 INFO  SetFieldValue Set: temperature=0.9
> 2011-03-07 11:04:34,002-0600 INFO  SetFieldValue Set:
> rundir=/home/hockyg/reichman/glassy_dynamics/code/runs/overlaps/replica_exchange/code/swift/run_beagle
> 2011-03-07 11:04:34,001-0600 INFO  SetFieldValue Set: label=1
> 2011-03-07 11:04:34,001-0600 INFO  SetFieldValue Set: radii=unnamed
> SwiftScript value.$[]/1
> 2011-03-07 11:04:34,002-0600 INFO  SetFieldValue Set: nsub=24
> 2011-03-07 11:04:39,581-0600 INFO  AbstractDataNode Found data
> modelIn.$[]/1.[0][3][19].inputstructure
> 2011-03-07 11:04:39,581-0600 INFO  AbstractDataNode Found data
> modelIn.$[]/1.[0][4][17].inputstructure
> 2011-03-07 11:04:39,582-0600 INFO  AbstractDataNode Found data
> modelIn.$[]/1.[0][4][18].inputstructure
> 2011-03-07 11:04:39,582-0600 INFO  AbstractDataNode Found data
> modelIn.$[]/1.[0][4][19].inputstructure
> 2011-03-07 11:04:39,582-0600 INFO  AbstractDataNode Found data
> modelIn.$[]/1.[0][0][20].inputstructure
> 
> 
> any ideas of where to look to troubleshoot this?
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list