[Swift-devel] Hangchecker tweak

Ketan Maheshwari ketancmaheshwari at gmail.com
Tue Jul 5 11:38:46 CDT 2011


On Wed, Jun 29, 2011 at 1:32 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> I strongly suspect that the hangs are not due to the hang checker.
>

This may be right. After many experiments (about 20) with large scale (upto
60-slots, 4-node) submissions with trunk, it seems that the jobs just do not
get submitted after a low arbitrary submissions.

Things that I observe with trunk on Beagle:

1. Disproportionate number of stage-ins happen when compared to the intended
number of jobs: for a 10-slot 4-node setup, 4980 stage-ins

2. The submit file created contained "node=" lined for 4-node jobs and not
for 2-node ones. I changed the use.mppwidth=false entry in
provider-pbs.properties to true. However, I do not know why this was
happening for 4-node jobs and not for the 2-node ones.

3. I see intermittent write failures from pbs to the swift.workdir with
"failed to transfer wrapper log messages".

Debugging more.

Ketan



> On Wed, 2011-06-29 at 13:02 -0500, Ketan Maheshwari wrote:
> >
> > I built Swift with this change and submitted a run with throttle value
> > of 3600 app tasks. It seems to be working. I see 3600 PBS jobs have
> > been submitted to Beagle.
> >
> >
> > On Wed, Jun 29, 2011 at 11:48 AM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >         On Wed, 2011-06-29 at 11:36 -0500, Ketan Maheshwari wrote:
> >
> >         > To confirm the hypothesis, could you indicate how could I
> >         disable the
> >         > hangchecker or increase the time period before it gets
> >         invoked.
> >
> >
> >         in Loader.main(), comment out the 'new
> >         HangChecker(stack).start()' line.
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
>
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110705/4cad8e18/attachment.html>


More information about the Swift-devel mailing list