<div dir="ltr">Thanks, indeed I picked the address from swift log thinking it was the one worker was able to connect but may be it was not the case.<div><br></div><div>This time I picked the one which seemed to have connected from the worker log which worked.</div><div><br></div><div>I found from the worker log that the worker tries too many times to connect, eg:</div><div><br></div><div><div>$ grep 'Trying other addresses' /scratch2/scratchdirs/ketan/workerlogs/worker-0106-5710080-000000.log | wc -l</div><div>6334</div></div><div><br></div><div>Wondering if it is possible that worker logic can be amended to detect the responsive address sooner.</div><div><br></div><div>Thanks,</div><div>Ketan </div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 6, 2015 at 12:36 PM, Hategan-Marandiuc, Philip M. <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Well, clearly 128.55.34.27 is not working. You need to find the correct<br>
one. I would suggest a combination of ifconfig and looking at worker<br>
logs without an internalHostname set to see what the last IP tried is.<br>
<br>
Mihael<br>
<div class="HOEnZb"><div class="h5"><br>
On Tue, 2015-01-06 at 12:21 -0600, Ketan Maheshwari wrote:<br>
> So, I tried with this line in sites file but the run crashes with following<br>
> error messages:<br>
><br>
> Execution failed:<br>
> Exception in wrf:<br>
> Arguments: []<br>
> Host: edison2<br>
> Directory: wf.edison-run006/jobs/i/wrf-iuas5r2m<br>
> exception @ swift-int.k, line: 530<br>
> Caused by: Block task failed: 0106-1110110-000000 Block task ended<br>
> prematurely<br>
> Application 9450632 exit codes: 101, 111<br>
> Application 9450632 resources: utime ~25s, stime ~30s, Rss ~8260, inblocks<br>
> ~425450, outblocks ~28500<br>
><br>
> +<br>
> --------------------------------------------------------------------------<br>
> + Job name: B0106-1110110-0<br>
> + Job Id: 2247186.edique02<br>
> + System: edison<br>
> + Queued Time: Tue Jan 6 10:11:12 2015<br>
> + Start Time: Tue Jan 6 10:12:20 2015<br>
> + Completion Time: Tue Jan 6 10:12:32 2015<br>
> + User: ketan<br>
> + MOM Host: nid02819<br>
> + Queue: debug<br>
> + Req. Resources: mppnodect=25,mppnppn=24,mppwidth=600,walltime=00:29:00<br>
> + Used Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:12<br>
> + Acct String: m1540<br>
> + PBS_O_WORKDIR: /global/u2/k/ketan/wrf<br>
> + Submit Args:<br>
> /global/u2/k/ketan/wrf/run006/scripts/PBS4838165627827831510.submit<br>
> +<br>
> --------------------------------------------------------------------------<br>
><br>
><br>
> Failed to connect: Network is unreachable at<br>
> /global/homes/k/ketan/.globus/coasters/<a href="http://cscript3816651147061795773.pl" target="_blank">cscript3816651147061795773.pl</a> line<br>
> 1101.<br>
> Failed to connect: Network is unreachable at<br>
> /global/homes/k/ketan/.globus/coasters/<a href="http://cscript3816651147061795773.pl" target="_blank">cscript3816651147061795773.pl</a> line<br>
> 1101.<br>
> Failed to connect: Network is unreachable at<br>
> /global/homes/k/ketan/.globus/coasters/<a href="http://cscript3816651147061795773.pl" target="_blank">cscript3816651147061795773.pl</a> line<br>
> 1101.<br>
> ....<br>
> ....<br>
> <<many more such messages>><br>
><br>
> This is a different ip that I found from the previous run's logs in a line<br>
> like this (due to a different login host) :<br>
><br>
> 2015-01-06 10:01:27,985-0800 INFO MetaChannel MetaChannel [context:<br>
> worker-6, boundTo: null] binding to TCPChannel [type: server, contact:<br>
> <a href="http://128.55.34.27:52189" target="_blank">128.55.34.27:52189</a>]<br>
><br>
> The rundir is attached.<br>
><br>
> --Ketan<br>
><br>
> On Sat, Jan 3, 2015 at 6:09 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>> wrote:<br>
><br>
> > Ok, so I was looking for problems after the first batch of jobs, but<br>
> > aren't any here.<br>
> ><br>
> > The 9 minute delay is because workers try all IP addresses that the head<br>
> > node has, and it may take a long time to time-out through all of them<br>
> > until a good one is found.<br>
> ><br>
> > You could force a specific IP address (in your case it's probably<br>
> > 128.55.34.2) using:<br>
> ><br>
> > <profile namespace="globus" key="internalHostname">128.55.34.2</profile><br>
> ><br>
> > Mihael<br>
> ><br>
> > On Sat, 2015-01-03 at 16:29 -0600, Ketan Maheshwari wrote:<br>
> > > Yes, this was a different run.<br>
> > ><br>
> > > Here is the run directory and worker log for a fresh run where I see job<br>
> > in<br>
> > > running stated for ~9 minutes before Swift status shows task active.<br>
> > ><br>
> > > Thanks,<br>
> > > Ketan<br>
> > ><br>
> > > On Sat, Jan 3, 2015 at 3:53 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > wrote:<br>
> > ><br>
> > > > Is this from the same run? I don't see delays between the jobs<br>
> > > > completing and the worker being shut down. Can you also post the swift<br>
> > > > log that corresponds to this run and confirm that you see the problem<br>
> > in<br>
> > > > this run?<br>
> > > ><br>
> > > > Mihael<br>
> > > ><br>
> > > > On Sat, 2015-01-03 at 15:22 -0600, Ketan Maheshwari wrote:<br>
> > > > > Please find the workerlog attached.<br>
> > > > ><br>
> > > > > Thanks,<br>
> > > > > Ketan<br>
> > > > ><br>
> > > > > On Sat, Jan 3, 2015 at 2:39 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > > > wrote:<br>
> > > > ><br>
> > > > > > On Tue, 2014-12-30 at 12:49 -0600, Ketan Maheshwari wrote:<br>
> > > > > > > Hi Mihael,<br>
> > > > > > ><br>
> > > > > > > It takes about 8-9 minutes after the worker starting (ie. queue<br>
> > > > showing<br>
> > > > > > > running status) that the Swift progress text shows active<br>
> > status. In<br>
> > > > the<br>
> > > > > > > active status, one wave of tasks finishes and the status goes<br>
> > back to<br>
> > > > > > > submit state but now no job shows up in the queue.<br>
> > > > > ><br>
> > > > > > I see the problem, but I'm not sure what causes it. Can you enable<br>
> > > > > > worker logging and send a worker log?<br>
> > > > > ><br>
> > > > > > Mihael<br>
> > > > > ><br>
> > > > > ><br>
> > > > > > _______________________________________________<br>
> > > > > > Swift-user mailing list<br>
> > > > > > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
> > > > > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
> > > > > ><br>
> > > ><br>
> > > ><br>
> > > > _______________________________________________<br>
> > > > Swift-user mailing list<br>
> > > > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
> > > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
> > > ><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > Swift-user mailing list<br>
> > <a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
> > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
> ><br>
<br>
<br>
</div></div></blockquote></div><br></div>