[Swift-user] Edison: pbs job starts but Swift unresponsive

Tue Jan 6 12:36:05 CST 2015

Well, clearly 128.55.34.27 is not working. You need to find the correct
one. I would suggest a combination of ifconfig and looking at worker
logs without an internalHostname set to see what the last IP tried is.

Mihael

On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote:
> So, I tried with this line in sites file but the run crashes with following
> error messages:
> 
> Execution failed:
> Exception in wrf:
>     Arguments: []
>     Host: edison2
>     Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m
> exception @ swift-int.k, line: 530
> Caused by: Block task failed: 0106-1110110-000000 Block task ended
> prematurely
> Application 9450632 exit codes: 101, 111
> Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks
> ~425450, outblocks ~28500
> 
>  +
> --------------------------------------------------------------------------
>  +        Job name: B0106-1110110-0
>  +          Job Id: 2247186.edique02
>  +          System: edison
>  +     Queued Time: Tue Jan  6 10:11:12 2015
>  +      Start Time: Tue Jan  6 10:12:20 2015
>  + Completion Time: Tue Jan  6 10:12:32 2015
>  +            User: ketan
>  +        MOM Host: nid02819
>  +           Queue: debug
>  +  Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00
>  +  Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12
>  +     Acct String: m1540
>  +   PBS_O_WORKDIR: /global/u2/k/ketan/wrf
>  +     Submit Args:
> /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit
>  +
> --------------------------------------------------------------------------
> 
> 
> Failed to connect: Network is unreachable at
> /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
> 1101.
> Failed to connect: Network is unreachable at
> /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
> 1101.
> Failed to connect: Network is unreachable at
> /global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
> 1101.
> ....
> ....
> <<many more such messages>>
> 
> This is a different ip that I found from the previous run's logs in a line
> like this (due to a different login host) :
> 
> 2015-01-06 10:01:27,985-0800 INFO  MetaChannel MetaChannel [context:
> worker-6, boundTo: null] binding to TCPChannel [type: server, contact:
> 128.55.34.27:52189]
> 
> The rundir is attached.
> 
> --Ketan
> 
> On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Ok, so I was looking for problems after the first batch of jobs, but
> > aren't any here.
> >
> > The 9 minute delay is because workers try all IP addresses that the head
> > node has, and it may take a long time to time-out through all of them
> > until a good one is found.
> >
> > You could force a specific IP address (in your case it's probably
> > 128.55.34.2) using:
> >
> > <profile namespace="globus" key="internalHostname">128.55.34.2</profile>
> >
> > Mihael
> >
> > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote:
> > > Yes, this was a different run.
> > >
> > > Here is the run directory and worker log for a fresh run where I see job
> > in
> > > running stated for ~9 minutes before Swift status shows task active.
> > >
> > > Thanks,
> > > Ketan
> > >
> > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > >
> > > > Is this from the same run? I don't see delays between the jobs
> > > > completing and the worker being shut down. Can you also post the swift
> > > > log that corresponds to this run and confirm that you see the problem
> > in
> > > > this run?
> > > >
> > > > Mihael
> > > >
> > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote:
> > > > > Please find the workerlog attached.
> > > > >
> > > > > Thanks,
> > > > > Ketan
> > > > >
> > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > > wrote:
> > > > >
> > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote:
> > > > > > > Hi Mihael,
> > > > > > >
> > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue
> > > > showing
> > > > > > > running status) that the Swift progress text shows active
> > status. In
> > > > the
> > > > > > > active status, one wave of tasks finishes and the status goes
> > back to
> > > > > > > submit state but now no job shows up in the queue.
> > > > > >
> > > > > > I see the problem, but I'm not sure what causes it. Can you enable
> > > > > > worker logging and send a worker log?
> > > > > >
> > > > > > Mihael
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > Swift-user mailing list
> > > > > > Swift-user at ci.uchicago.edu
> > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-user mailing list
> > > > Swift-user at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> >
> >
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >