[Swift-user] Edison: pbs job starts but Swift unresponsive

Tue Jan 6 12:56:01 CST 2015

This could be a more general solution:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1393

But it probably won't make it into 0.95.

Mihael

On Tue, 2015-01-06 at 10:36 -0800, Mihael Hategan wrote:
> Well, clearly 128.55.34.27 is not working. You need to find the correct
> one. I would suggest a combination of ifconfig and looking at worker
> logs without an internalHostname set to see what the last IP tried is.
> 
> Mihael
> 
> On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote:
> > So, I tried with this line in sites file but the run crashes with following
> > error messages:
> > 
> > Execution failed:
> > Exception in wrf:
> >     Arguments: []
> >     Host: edison2
> >     Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m
> > exception @ swift-int.k, line: 530
> > Caused by: Block task failed: 0106-1110110-000000 Block task ended
> > prematurely
> > Application 9450632 exit codes: 101, 111
> > Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks
> > ~425450, outblocks ~28500
> > 
> >  +
> > --------------------------------------------------------------------------
> >  +        Job name: B0106-1110110-0
> >  +          Job Id: 2247186.edique02
> >  +          System: edison
> >  +     Queued Time: Tue Jan  6 10:11:12 2015
> >  +      Start Time: Tue Jan  6 10:12:20 2015
> >  + Completion Time: Tue Jan  6 10:12:32 2015
> >  +            User: ketan
> >  +        MOM Host: nid02819
> >  +           Queue: debug
> >  +  Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00
> >  +  Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12
> >  +     Acct String: m1540
> >  +   PBS_O_WORKDIR: /global/u2/k/ketan/wrf
> >  +     Submit Args:
> > /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit
> >  +
> > --------------------------------------------------------------------------
> > 
> > 
> > Failed to connect: Network is unreachable at
> > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
> > 1101.
> > Failed to connect: Network is unreachable at
> > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
> > 1101.
> > Failed to connect: Network is unreachable at
> > /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
> > 1101.
> > ....
> > ....
> > <<many more such messages>>
> > 
> > This is a different ip that I found from the previous run's logs in a line
> > like this (due to a different login host) :
> > 
> > 2015-01-06 10:01:27,985-0800 INFO  MetaChannel MetaChannel [context:
> > worker-6, boundTo: null] binding to TCPChannel [type: server, contact:
> > 128.55.34.27:52189]
> > 
> > The rundir is attached.
> > 
> > --Ketan
> > 
> > On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > 
> > > Ok, so I was looking for problems after the first batch of jobs, but
> > > aren't any here.
> > >
> > > The 9 minute delay is because workers try all IP addresses that the head
> > > node has, and it may take a long time to time-out through all of them
> > > until a good one is found.
> > >
> > > You could force a specific IP address (in your case it's probably
> > > 128.55.34.2) using:
> > >
> > > <profile namespace="globus" key="internalHostname">128.55.34.2</profile>
> > >
> > > Mihael
> > >
> > > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote:
> > > > Yes, this was a different run.
> > > >
> > > > Here is the run directory and worker log for a fresh run where I see job
> > > in
> > > > running stated for ~9 minutes before Swift status shows task active.
> > > >
> > > > Thanks,
> > > > Ketan
> > > >
> > > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >
> > > > > Is this from the same run? I don't see delays between the jobs
> > > > > completing and the worker being shut down. Can you also post the swift
> > > > > log that corresponds to this run and confirm that you see the problem
> > > in
> > > > > this run?
> > > > >
> > > > > Mihael
> > > > >
> > > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote:
> > > > > > Please find the workerlog attached.
> > > > > >
> > > > > > Thanks,
> > > > > > Ketan
> > > > > >
> > > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > > > wrote:
> > > > > >
> > > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote:
> > > > > > > > Hi Mihael,
> > > > > > > >
> > > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue
> > > > > showing
> > > > > > > > running status) that the Swift progress text shows active
> > > status. In
> > > > > the
> > > > > > > > active status, one wave of tasks finishes and the status goes
> > > back to
> > > > > > > > submit state but now no job shows up in the queue.
> > > > > > >
> > > > > > > I see the problem, but I'm not sure what causes it. Can you enable
> > > > > > > worker logging and send a worker log?
> > > > > > >
> > > > > > > Mihael
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Swift-user mailing list
> > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user