[Swift-devel] Hangchecker tweak
Ketan Maheshwari
ketancmaheshwari at gmail.com
Tue Jul 5 11:38:46 CDT 2011
On Wed, Jun 29, 2011 at 1:32 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> I strongly suspect that the hangs are not due to the hang checker.
>
This may be right. After many experiments (about 20) with large scale (upto
60-slots, 4-node) submissions with trunk, it seems that the jobs just do not
get submitted after a low arbitrary submissions.
Things that I observe with trunk on Beagle:
1. Disproportionate number of stage-ins happen when compared to the intended
number of jobs: for a 10-slot 4-node setup, 4980 stage-ins
2. The submit file created contained "node=" lined for 4-node jobs and not
for 2-node ones. I changed the use.mppwidth=false entry in
provider-pbs.properties to true. However, I do not know why this was
happening for 4-node jobs and not for the 2-node ones.
3. I see intermittent write failures from pbs to the swift.workdir with
"failed to transfer wrapper log messages".
Debugging more.
Ketan
> On Wed, 2011-06-29 at 13:02 -0500, Ketan Maheshwari wrote:
> >
> > I built Swift with this change and submitted a run with throttle value
> > of 3600 app tasks. It seems to be working. I see 3600 PBS jobs have
> > been submitted to Beagle.
> >
> >
> > On Wed, Jun 29, 2011 at 11:48 AM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > On Wed, 2011-06-29 at 11:36 -0500, Ketan Maheshwari wrote:
> >
> > > To confirm the hypothesis, could you indicate how could I
> > disable the
> > > hangchecker or increase the time period before it gets
> > invoked.
> >
> >
> > in Loader.main(), comment out the 'new
> > HangChecker(stack).start()' line.
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110705/4cad8e18/attachment.html>
More information about the Swift-devel
mailing list