Mihael:  thanks, I appreciate it, sorry to bug you<br><br>Mike: this problem was occuring to me on PADS (the thread was originally about a similar problem on Beagle).  I haven't made any progress debugging the issue on beagle, beyond coming up with the minimal example to replicate it.  I managed to pare down the example even more: it deadlocks if the pthread library is linked dynamically, even if no pthreads functions are actually used. Ie. the deadlock happens at the time the shared library is loaded.  I unsuccessfully attempted some different workarounds.  I'm pretty much out of ideas on how to make progress on this - getting Cray in on the problem might be best at this point.<br>
<br>- Tim<br><br><br><div class="gmail_quote">On Thu, Jun 2, 2011 at 3:39 PM, Mihael Hategan <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Yes. Sorry about the delay. The word is that I need to backport the<br>
patch from trunk to 0.92 and then have a patch release. I was waiting<br>
for words from other folks, and I got that yesterday. I will be doing<br>
this as soon as I have some time, which is probably somewhere between<br>
today and next Tuesday.<br>
<font color="#888888"><br>
Mihael<br>
</font><div><div></div><div class="h5"><br>
On Thu, 2011-06-02 at 15:24 -0500, Tim Armstrong wrote:<br>
> Any word on this bug?  I have a nice use-case for SwiftR where it<br>
> would be very handy to take advantage of Swift's dynamic resource<br>
> procurement.<br>
><br>
> - Tim<br>
><br>
> On Thu, May 26, 2011 at 3:41 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> wrote:<br>
>         Given that this has now been reported a number of times, it<br>
>         may make<br>
>         sense to backport the fix from trunk and make a patch release<br>
>         for 0.92.<br>
><br>
>         Objections?<br>
><br>
><br>
>         On Thu, 2011-05-26 at 14:59 -0500, Tim Armstrong wrote:<br>
>         > Hi,<br>
>         >   I've encountered this issue with SwiftR, running release<br>
>         0.92 from<br>
>         > the svn repository.  The issue occurs when<br>
>         > GLOBUS::maxWallTime="03:55:00" in tc and maxTime is 4 hours<br>
>         in<br>
>         > sites.xml.  After 5 minutes (or whatever the difference is<br>
>         between the<br>
>         > two times), I get the exception copied below.  A tarball is<br>
>         attached<br>
>         > with the logs, script, etc.  replicate.sh shows how to<br>
>         replicate the<br>
>         > issue on PADS.<br>
>         ><br>
>         > Assuming that my problem is the same as the others, it would<br>
>         be good<br>
>         > if the fix could be merged to release 0.92, as I'm trying to<br>
>         bundle<br>
>         > stable swift releases with SwiftR.<br>
>         ><br>
>         > - Tim<br>
>         ><br>
>         ><br>
>         > Swift svn swift-r4336 cog-r3096 (cog modified locally)<br>
>         ><br>
>         > RunID: 20110526-1317-2c8ybi10<br>
>         > Progress:<br>
>         > SwiftScript trace: top of loop: rserver waiting for input<br>
>         > on, /tmp/nbest/SwiftR/swift.0827/requestpipe<br>
>         > Progress:  Active:1<br>
>         > Progress:  Finished successfully:1<br>
>         > SwiftScript trace: rserver: got<br>
>         > dir, /tmp/nbest/SwiftR/requests.P09626/R0000007<br>
>         > Progress:  uninitialized:1  Finished successfully:1<br>
>         > Progress:  Submitted:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > Progress:  Active:1  Finished successfully:1<br>
>         > queuedsize > 0 but no job dequeued. Queued: {}<br>
>         > java.lang.Throwable<br>
>         >         at<br>
>         ><br>
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:252)<br>
>         >         at<br>
>         ><br>
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:520)<br>
>         >         at<br>
>         ><br>
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)<br>
>         > queuedsize > 0 but no job dequeued. Queued: {}<br>
>         > java.lang.Throwable<br>
>         >         at<br>
>         ><br>
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:252)<br>
>         >         at<br>
>         ><br>
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:520)<br>
>         >         at<br>
>         ><br>
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)<br>
>         > Progress:  Finished successfully:1 Failed but can retry:1<br>
>         ><br>
>         ><br>
>         > On Sun, May 22, 2011 at 1:51 PM, Mihael Hategan<br>
>         <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
>         > wrote:<br>
>         >         The second one looks to me like a coaster problem.<br>
>         Can't say<br>
>         >         much about<br>
>         >         the first issue.<br>
>         ><br>
>         >         Can you try with plain pbs if you want to test the<br>
>         pbs<br>
>         >         provider?<br>
>         ><br>
>         >         Mihael<br>
>         ><br>
>         ><br>
>         >         On Sun, 2011-05-22 at 08:39 -0500, ketan wrote:<br>
>         >         > I can confirm that the trunk is not usable for pbs<br>
>         provider.<br>
>         >         I am using<br>
>         >         > trunk for submitting jobs on beagle and I see a<br>
>         few<br>
>         >         unexpected things:<br>
>         >         ><br>
>         >         > 1. The stderr is showing inconsistent messages:<br>
>         The results<br>
>         >         are getting<br>
>         >         > written to the output even though stderr doesn't<br>
>         report any.<br>
>         >         > 2. qsub jobs being cancelled inadvertantly: I<br>
>         submitted 40<br>
>         >         of them<br>
>         >         > yesterday, however, only 2 survived today. The log<br>
>         is here:<br>
>         >         ><br>
>         >         ><br>
>         ><br>
>         <a href="http://www.ci.uchicago.edu/%7Eketan/files/ftdock-20110521-0337-pokpgg89.log" target="_blank">http://www.ci.uchicago.edu/~ketan/files/ftdock-20110521-0337-pokpgg89.log</a><br>
>         >         ><br>
>         >         > In addition, the ssh-pbs provider does not seem to<br>
>         be<br>
>         >         working for large<br>
>         >         > runs (it worked for a small number of test runs):<br>
>         Getting<br>
>         >         unexpected<br>
>         >         > stdouts. Following is the stdout:<br>
>         >         ><br>
>         >         ><br>
>         <a href="http://www.ci.uchicago.edu/%7Eketan/files/ssh-pbs.stdout" target="_blank">http://www.ci.uchicago.edu/~ketan/files/ssh-pbs.stdout</a><br>
>         >         ><br>
>         >         > Following is the log file for the above run:<br>
>         >         ><br>
>         >         ><br>
>         ><br>
>         <a href="http://www.ci.uchicago.edu/%7Eketan/files/ftdock-20110521-1750-b0cot9sa.log" target="_blank">http://www.ci.uchicago.edu/~ketan/files/ftdock-20110521-1750-b0cot9sa.log</a><br>
>         >         ><br>
>         >         ><br>
>         >         > Ketan<br>
>         >         ><br>
>         >         > On 5/21/11 5:12 PM, Michael Wilde wrote:<br>
>         >         > ><br>
>         >         > > ----- Original Message -----<br>
>         >         > >> On Sat, 2011-05-21 at 17:06 -0400, Glen Hocky<br>
>         wrote:<br>
>         >         > >>> as I mentioned, I've been running with Mike's<br>
>         swift<br>
>         >         which was<br>
>         >         > >>> patched<br>
>         >         > >>> for beagle. are all the things that make<br>
>         running on<br>
>         >         beagle work in<br>
>         >         > >>> trunk?<br>
>         >         > >> No idea.<br>
>         >         > >><br>
>         >         > >> Mike?<br>
>         >         > > Justin, working with Ketan, just applied changes<br>
>         to trunk<br>
>         >         which should make it work now on Beagle (or any Cray<br>
>         XT5+ or<br>
>         >         XE).  This uses a different set of sites.xml tags<br>
>         than the<br>
>         >         prototype in the current Beagle swift 0.92.1 module.<br>
>         Justin<br>
>         >         has a note on this at:<br>
>         >         > ><br>
>          <a href="https://sites.google.com/site/swiftdevel/sites/pbs/cray" target="_blank">https://sites.google.com/site/swiftdevel/sites/pbs/cray</a><br>
>         >         > ><br>
>         >         > > It was working before for one-node worker jobs;<br>
>         now it<br>
>         >         should work for multi-node worker jobs as well.<br>
>         >         > ><br>
>         >         > > Justin and Ketan should comment on the state of<br>
>         testing<br>
>         >         and readiness of this trunk feature.  Don't try<br>
>         trunk on<br>
>         >         Beagle till they give the go-ahead.<br>
>         >         > ><br>
>         >         > > - Mike<br>
>         >         > ><br>
>         >         > >>>   If so i'll update to the latest and test. I<br>
>         don't<br>
>         >         think I'm<br>
>         >         > >>> using stable...<br>
>         >         > >> Ok<br>
>         >         > >><br>
>         >         > >> Mihael<br>
>         >         > _______________________________________________<br>
>         >         > Swift-devel mailing list<br>
>         >         > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
>         >         ><br>
>         <a href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel" target="_blank">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a><br>
>         ><br>
>         ><br>
>         >         _______________________________________________<br>
>         >         Swift-devel mailing list<br>
>         >         <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
>         ><br>
>         <a href="http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel" target="_blank">http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel</a><br>
>         ><br>
>         ><br>
><br>
><br>
><br>
><br>
<br>
<br>
</div></div></blockquote></div><br>