[Swift-user] Edison: pbs job starts but Swift unresponsive

Ketan Maheshwari ketan at mcs.anl.gov
Tue Jan 6 13:30:00 CST 2015


Thanks, indeed I picked the address from swift log thinking it was the one
worker was able to connect but may be it was not the case.

This time I picked the one which seemed to have connected from the worker
log which worked.

I found from the worker log that the worker tries too many times to
connect, eg:

$ grep 'Trying other addresses'
/scratch2/scratchdirs/ketan/workerlogs/worker-0106-5710080-000000.log | wc
-l
6334

Wondering if it is possible that worker logic can be amended to detect the
responsive address sooner.

Thanks,
Ketan


On Tue, Jan 6, 2015 at 12:36 PM, Hategan-Marandiuc, Philip M. <
hategan at mcs.anl.gov> wrote:

> Well, clearly 128.55.34.27 is not working. You need to find the correct
> one. I would suggest a combination of ifconfig and looking at worker
> logs without an internalHostname set to see what the last IP tried is.
>
> Mihael
>
> On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote:
> > So, I tried with this line in sites file but the run crashes with
> following
> > error messages:
> >
> > Execution failed:
> > Exception in wrf:
> >     Arguments: []
> >     Host: edison2
> >     Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m
> > exception @ swift-int.k, line: 530
> > Caused by: Block task failed: 0106-1110110-000000 Block task ended
> > prematurely
> > Application 9450632 exit codes: 101, 111
> > Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260,
> inblocks
> > ~425450, outblocks ~28500
> >
> >  +
> >
> --------------------------------------------------------------------------
> >  +        Job name: B0106-1110110-0
> >  +          Job Id: 2247186.edique02
> >  +          System: edison
> >  +     Queued Time: Tue Jan  6 10:11:12 2015
> >  +      Start Time: Tue Jan  6 10:12:20 2015
> >  + Completion Time: Tue Jan  6 10:12:32 2015
> >  +            User: ketan
> >  +        MOM Host: nid02819
> >  +           Queue: debug
> >  +  Req. Resources:
> mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00
> >  +  Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12
> >  +     Acct String: m1540
> >  +   PBS_O_WORKDIR: /global/u2/k/ketan/wrf
> >  +     Submit Args:
> > /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit
> >  +
> >
> --------------------------------------------------------------------------
> >
> >
> > Failed to connect: Network is unreachable at
> > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl
> line
> > 1101.
> > Failed to connect: Network is unreachable at
> > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl
> line
> > 1101.
> > Failed to connect: Network is unreachable at
> > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl
> line
> > 1101.
> > ....
> > ....
> > <<many more such messages>>
> >
> > This is a different ip that I found from the previous run's logs in a
> line
> > like this (due to a different login host) :
> >
> > 2015-01-06 10:01:27,985-0800 INFO  MetaChannel MetaChannel [context:
> > worker-6, boundTo: null] binding to TCPChannel [type: server, contact:
> > 128.55.34.27:52189]
> >
> > The rundir is attached.
> >
> > --Ketan
> >
> > On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > Ok, so I was looking for problems after the first batch of jobs, but
> > > aren't any here.
> > >
> > > The 9 minute delay is because workers try all IP addresses that the
> head
> > > node has, and it may take a long time to time-out through all of them
> > > until a good one is found.
> > >
> > > You could force a specific IP address (in your case it's probably
> > > 128.55.34.2) using:
> > >
> > > <profile namespace="globus"
> key="internalHostname">128.55.34.2</profile>
> > >
> > > Mihael
> > >
> > > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote:
> > > > Yes, this was a different run.
> > > >
> > > > Here is the run directory and worker log for a fresh run where I see
> job
> > > in
> > > > running stated for ~9 minutes before Swift status shows task active.
> > > >
> > > > Thanks,
> > > > Ketan
> > > >
> > > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >
> > > > > Is this from the same run? I don't see delays between the jobs
> > > > > completing and the worker being shut down. Can you also post the
> swift
> > > > > log that corresponds to this run and confirm that you see the
> problem
> > > in
> > > > > this run?
> > > > >
> > > > > Mihael
> > > > >
> > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote:
> > > > > > Please find the workerlog attached.
> > > > > >
> > > > > > Thanks,
> > > > > > Ketan
> > > > > >
> > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan <
> hategan at mcs.anl.gov>
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote:
> > > > > > > > Hi Mihael,
> > > > > > > >
> > > > > > > > It takes about 8-9 minutes after the worker starting (ie.
> queue
> > > > > showing
> > > > > > > > running status) that the Swift progress text shows active
> > > status. In
> > > > > the
> > > > > > > > active status, one wave of tasks finishes and the status goes
> > > back to
> > > > > > > > submit state but now no job shows up in the queue.
> > > > > > >
> > > > > > > I see the problem, but I'm not sure what causes it. Can you
> enable
> > > > > > > worker logging and send a worker log?
> > > > > > >
> > > > > > > Mihael
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Swift-user mailing list
> > > > > > > Swift-user at ci.uchicago.edu
> > > > > > >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150106/4a7ac69c/attachment.html>


More information about the Swift-user mailing list