[Swift-devel] problem with max num jobs, array entries?
Mihael Hategan
hategan at mcs.anl.gov
Mon Mar 7 12:34:51 CST 2011
Usually, hangs occur for two major reasons:
1. a java deadlock. You can find out whether this is the case by getting
a stack dump of the swift process with jstack.
2. a swift/karajan deadlock. I added some code yesterday to trunk to
detect these and dump the situation to the log.
Apart from that, there is also the possibility of a silent OOM, but I
doubt it is the case with 100 jobs.
Mihael
On Mon, 2011-03-07 at 12:31 -0500, Glen Hocky wrote:
> hey Mike, devs
>
> i was wondering if you could help me track something down. i may not
> have noticed this before because of the way I was running my jobs but
> i'm having a problem running more than ~100 jobs w/ my swift script
> (with pbs or pbs+coasters). it just hangs with
> "Progress: "
> "Progress: "
> "Progress: "
>
>
> in the swift log it just stalls at this point
> ...
> 2011-03-07 10:26:53,280-0600 INFO SetFieldValue Set: force=FALSE
> 2011-03-07 10:26:53,284-0600 INFO VDLFunction FUNCTION: arg()
> 2011-03-07 10:26:53,285-0600 INFO VDLFunction FUNCTION: toint()
> 2011-03-07 10:26:53,285-0600 INFO SetFieldValue Set: printfreq=500
> 2011-03-07 10:26:53,285-0600 INFO VDLFunction FUNCTION: arg()
> 2011-03-07 10:26:53,285-0600 INFO VDLFunction FUNCTION: toint()
> 2011-03-07 10:26:53,285-0600 INFO SetFieldValue Set: nmodels=5
> 2011-03-07 10:26:53,286-0600 INFO VDLFunction FUNCTION: arg()
> 2011-03-07 10:26:53,286-0600 INFO VDLFunction FUNCTION: toint()
> 2011-03-07 10:26:53,286-0600 INFO SetFieldValue Set: nsub=20
>
>
>
>
> whereas when i decrease the number of total jobs it goes to
> ...
> 2011-03-07 11:04:34,001-0600 INFO SetFieldValue Set: nmodels=4
> 2011-03-07 11:04:34,001-0600 INFO SetFieldValue Set: temperature=0.9
> 2011-03-07 11:04:34,002-0600 INFO SetFieldValue Set:
> rundir=/home/hockyg/reichman/glassy_dynamics/code/runs/overlaps/replica_exchange/code/swift/run_beagle
> 2011-03-07 11:04:34,001-0600 INFO SetFieldValue Set: label=1
> 2011-03-07 11:04:34,001-0600 INFO SetFieldValue Set: radii=unnamed
> SwiftScript value.$[]/1
> 2011-03-07 11:04:34,002-0600 INFO SetFieldValue Set: nsub=24
> 2011-03-07 11:04:39,581-0600 INFO AbstractDataNode Found data
> modelIn.$[]/1.[0][3][19].inputstructure
> 2011-03-07 11:04:39,581-0600 INFO AbstractDataNode Found data
> modelIn.$[]/1.[0][4][17].inputstructure
> 2011-03-07 11:04:39,582-0600 INFO AbstractDataNode Found data
> modelIn.$[]/1.[0][4][18].inputstructure
> 2011-03-07 11:04:39,582-0600 INFO AbstractDataNode Found data
> modelIn.$[]/1.[0][4][19].inputstructure
> 2011-03-07 11:04:39,582-0600 INFO AbstractDataNode Found data
> modelIn.$[]/1.[0][0][20].inputstructure
>
>
> any ideas of where to look to troubleshoot this?
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list