<div dir="ltr">Thanks, indeed I picked the address from swift log thinking it was the one worker was able to connect but may be it was not the case.<div><br></div><div>This time I picked the one which seemed to have connected from the worker log which worked.</div><div><br></div><div>I found from the worker log that the worker tries too many times to connect, eg:</div><div><br></div><div><div>$ grep 'Trying other addresses' /scratch2/scratchdirs/ketan/workerlogs/worker-0106-5710080-000000.log | wc -l</div><div>6334</div></div><div><br></div><div>Wondering if it is possible that worker logic can be amended to detect the responsive address sooner.</div><div><br></div><div>Thanks,</div><div>Ketan </div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 6, 2015 at 12:36 PM, Hategan-Marandiuc, Philip M. <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Well, clearly 128.55.34.27 is not working. You need to find the correct<br>

one. I would suggest a combination of ifconfig and looking at worker<br>

logs without an internalHostname set to see what the last IP tried is.<br>

<br>

Mihael<br>

<div class="HOEnZb"><div class="h5"><br>

On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote:<br>

> So, I tried with this line in sites file but the run crashes with following<br>

> error messages:<br>

><br>

> Execution failed:<br>

> Exception in wrf:<br>

>     Arguments: []<br>

>     Host: edison2<br>

>     Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m<br>

> exception @ swift-int.k, line: 530<br>

> Caused by: Block task failed: 0106-1110110-000000 Block task ended<br>

> prematurely<br>

> Application 9450632 exit codes: 101, 111<br>

> Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks<br>

> ~425450, outblocks ~28500<br>

><br>

>  +<br>

> --------------------------------------------------------------------------<br>

>  +        Job name: B0106-1110110-0<br>

>  +          Job Id: 2247186.edique02<br>

>  +          System: edison<br>

>  +     Queued Time: Tue Jan  6 10:11:12 2015<br>

>  +      Start Time: Tue Jan  6 10:12:20 2015<br>

>  + Completion Time: Tue Jan  6 10:12:32 2015<br>

>  +            User: ketan<br>

>  +        MOM Host: nid02819<br>

>  +           Queue: debug<br>

>  +  Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00<br>

>  +  Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12<br>

>  +     Acct String: m1540<br>

>  +   PBS_O_WORKDIR: /global/u2/k/ketan/wrf<br>

>  +     Submit Args:<br>

> /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit<br>

>  +<br>

> --------------------------------------------------------------------------<br>

><br>

><br>

> Failed to connect: Network is unreachable at<br>

> /global/homes/k/ketan/.globus/coasters/<a href="http://cscript3816651147061795773.pl" target="_blank">cscript3816651147061795773.pl</a> line<br>

> 1101.<br>

> Failed to connect: Network is unreachable at<br>

> /global/homes/k/ketan/.globus/coasters/<a href="http://cscript3816651147061795773.pl" target="_blank">cscript3816651147061795773.pl</a> line<br>

> 1101.<br>

> Failed to connect: Network is unreachable at<br>

> /global/homes/k/ketan/.globus/coasters/<a href="http://cscript3816651147061795773.pl" target="_blank">cscript3816651147061795773.pl</a> line<br>

> 1101.<br>

> ....<br>

> ....<br>

> <<many more such messages>><br>

><br>

> This is a different ip that I found from the previous run's logs in a line<br>

> like this (due to a different login host) :<br>

><br>

> 2015-01-06 10:01:27,985-0800 INFO  MetaChannel MetaChannel [context:<br>

> worker-6, boundTo: null] binding to TCPChannel [type: server, contact:<br>

> <a href="http://128.55.34.27:52189" target="_blank">128.55.34.27:52189</a>]<br>

><br>

> The rundir is attached.<br>

><br>

> --Ketan<br>

><br>

> On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>> wrote:<br>

><br>

> > Ok, so I was looking for problems after the first batch of jobs, but<br>

> > aren't any here.<br>

> ><br>

> > The 9 minute delay is because workers try all IP addresses that the head<br>

> > node has, and it may take a long time to time-out through all of them<br>

> > until a good one is found.<br>

> ><br>

> > You could force a specific IP address (in your case it's probably<br>

> > 128.55.34.2) using:<br>

> ><br>

> > <profile namespace="globus" key="internalHostname">128.55.34.2</profile><br>

> ><br>

> > Mihael<br>

> ><br>

> > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote:<br>

> > > Yes, this was a different run.<br>

> > ><br>

> > > Here is the run directory and worker log for a fresh run where I see job<br>

> > in<br>

> > > running stated for ~9 minutes before Swift status shows task active.<br>

> > ><br>

> > > Thanks,<br>

> > > Ketan<br>

> > ><br>

> > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>

> > wrote:<br>

> > ><br>

> > > > Is this from the same run? I don't see delays between the jobs<br>

> > > > completing and the worker being shut down. Can you also post the swift<br>

> > > > log that corresponds to this run and confirm that you see the problem<br>

> > in<br>

> > > > this run?<br>

> > > ><br>

> > > > Mihael<br>

> > > ><br>

> > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote:<br>

> > > > > Please find the workerlog attached.<br>

> > > > ><br>

> > > > > Thanks,<br>

> > > > > Ketan<br>

> > > > ><br>

> > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>

> > > > wrote:<br>

> > > > ><br>

> > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote:<br>

> > > > > > > Hi Mihael,<br>

> > > > > > ><br>

> > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue<br>

> > > > showing<br>

> > > > > > > running status) that the Swift progress text shows active<br>

> > status. In<br>

> > > > the<br>

> > > > > > > active status, one wave of tasks finishes and the status goes<br>

> > back to<br>

> > > > > > > submit state but now no job shows up in the queue.<br>

> > > > > ><br>

> > > > > > I see the problem, but I'm not sure what causes it. Can you enable<br>

> > > > > > worker logging and send a worker log?<br>

> > > > > ><br>

> > > > > > Mihael<br>

> > > > > ><br>

> > > > > ><br>

> > > > > > _______________________________________________<br>

> > > > > > Swift-user mailing list<br>

> > > > > > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>

> > > > > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>

> > > > > ><br>

> > > ><br>

> > > ><br>

> > > > _______________________________________________<br>

> > > > Swift-user mailing list<br>

> > > > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>

> > > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>

> > > ><br>

> ><br>

> ><br>

> > _______________________________________________<br>

> > Swift-user mailing list<br>

> > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>

> > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>

> ><br>

<br>

<br>

</div></div></blockquote></div><br></div>