[Swift-user] Edison: pbs job starts but Swift unresponsive

Ketan Maheshwari ketan at mcs.anl.gov
Tue Jan 6 12:21:26 CST 2015


So, I tried with this line in sites file but the run crashes with following
error messages:

Execution failed:
Exception in wrf:
    Arguments: []
    Host: edison2
    Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m
exception @ swift-int.k, line: 530
Caused by: Block task failed: 0106-1110110-000000 Block task ended
prematurely
Application 9450632 exit codes: 101, 111
Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks
~425450, outblocks ~28500

 +
--------------------------------------------------------------------------
 +        Job name: B0106-1110110-0
 +          Job Id: 2247186.edique02
 +          System: edison
 +     Queued Time: Tue Jan  6 10:11:12 2015
 +      Start Time: Tue Jan  6 10:12:20 2015
 + Completion Time: Tue Jan  6 10:12:32 2015
 +            User: ketan
 +        MOM Host: nid02819
 +           Queue: debug
 +  Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00
 +  Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12
 +     Acct String: m1540
 +   PBS_O_WORKDIR: /global/u2/k/ketan/wrf
 +     Submit Args:
/global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit
 +
--------------------------------------------------------------------------


Failed to connect: Network is unreachable at
/global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
1101.
Failed to connect: Network is unreachable at
/global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
1101.
Failed to connect: Network is unreachable at
/global/homes/k/ketan/.globus/coasters/cscript3816651147061795773.pl line
1101.
....
....
<<many more such messages>>

This is a different ip that I found from the previous run's logs in a line
like this (due to a different login host) :

2015-01-06 10:01:27,985-0800 INFO  MetaChannel MetaChannel [context:
worker-6, boundTo: null] binding to TCPChannel [type: server, contact:
128.55.34.27:52189]

The rundir is attached.

--Ketan

On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Ok, so I was looking for problems after the first batch of jobs, but
> aren't any here.
>
> The 9 minute delay is because workers try all IP addresses that the head
> node has, and it may take a long time to time-out through all of them
> until a good one is found.
>
> You could force a specific IP address (in your case it's probably
> 128.55.34.2) using:
>
> <profile namespace="globus" key="internalHostname">128.55.34.2</profile>
>
> Mihael
>
> On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote:
> > Yes, this was a different run.
> >
> > Here is the run directory and worker log for a fresh run where I see job
> in
> > running stated for ~9 minutes before Swift status shows task active.
> >
> > Thanks,
> > Ketan
> >
> > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > Is this from the same run? I don't see delays between the jobs
> > > completing and the worker being shut down. Can you also post the swift
> > > log that corresponds to this run and confirm that you see the problem
> in
> > > this run?
> > >
> > > Mihael
> > >
> > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote:
> > > > Please find the workerlog attached.
> > > >
> > > > Thanks,
> > > > Ketan
> > > >
> > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >
> > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote:
> > > > > > Hi Mihael,
> > > > > >
> > > > > > It takes about 8-9 minutes after the worker starting (ie. queue
> > > showing
> > > > > > running status) that the Swift progress text shows active
> status. In
> > > the
> > > > > > active status, one wave of tasks finishes and the status goes
> back to
> > > > > > submit state but now no job shows up in the queue.
> > > > >
> > > > > I see the problem, but I'm not sure what causes it. Can you enable
> > > > > worker logging and send a worker log?
> > > > >
> > > > > Mihael
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150106/26be30ab/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run006.tgz
Type: application/x-gzip
Size: 34828 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150106/26be30ab/attachment.bin>


More information about the Swift-user mailing list