<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Mike, all,<div><div><br></div><div>I’ve done a few more experiments and realized that while I had added the -XX:-UseLargePages command line argument to the main Java invocation, each app also included a post process Java app which I hadn’t added the UseLargePages argument to. Perhaps because this was running at the end of an app, the warnings that I’d seen previously prior to the crashes weren’t making it to stdout before the crashes. Purely speculation.</div><div>In any case, I did put the tasksPerWorker back to 16 and was able to successfully run all the jobs. I guess one bit of consolation in all this is that if there are any future “mystery” worker failings, this will be one issue to check.</div><div><br></div><div>Thank you all for helping with tracking this down.</div><div><br></div><div>Jonathan</div><div><br><div><div>On Jul 31, 2014, at 3:18 PM, Michael Wilde <<a href="mailto:wilde@anl.gov">wilde@anl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">
  
    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
  
  <div text="#000000" bgcolor="#FFFFFF">
    Its odd that no errors like OOM or OOT would be logged to stdout of
    the PBS job.<br>
    <br>
    Jonathan, can you check with the Blues Sysadmins if they have any
    other error logs (e.g. dmesg/syslogs) for your jobs, or for the
    nodes on which they ran?  <br>
    <br>
    Thanks,<br>
    <br>
    - Mike<br>
    <br>
    <div class="moz-cite-prefix">On 7/31/14, 1:55 PM, David Kelly wrote:<br>
    </div>
    <blockquote cite="mid:CA+_+Ey-QVO_wsw0V-ViFnb9kiYPs7Va6rhZvrWtebgajtZoNdA@mail.gmail.com" type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      <div dir="ltr">Is each invocation of the Java app creating a large
        number of threads? I ran into an issue like that (on another
        cluster) where I was hitting the maximum number of processes per
        node, and the scheduler ended up killing my jobs.</div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Thu, Jul 31, 2014 at 1:42 PM,
          Jonathan Ozik <span dir="ltr"><<a moz-do-not-send="true" href="mailto:jozik@uchicago.edu" target="_blank">jozik@uchicago.edu</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto;">Okay, I’ve
            launched a new job, with tasksPerWorker=8. This is running
            on the sSerial queue rather than the shared queue for the
            other runs. Just wanted to note that for comparison. Each of
            the java processes that are launched with -Xmx1536m. I
            believe that Blues advertises each node having access to
            64GB (<a moz-do-not-send="true" href="http://www.lcrc.anl.gov/about/Blues" target="_blank">http://www.lcrc.anl.gov/about/Blues</a>),
            so at least at first glance the memory issue doesn’t seem
            like it could be an issue.<br>
            <span class="HOEnZb"><font color="#888888"><br>
                Jonathan<br>
              </font></span>
            <div class="HOEnZb">
              <div class="h5"><br>
                On Jul 31, 2014, at 1:18 PM, Mihael Hategan <<a moz-do-not-send="true" href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>>
                wrote:<br>
                <br>
                > Ok, so the workers die while the jobs are running
                and not much else is<br>
                > happening.<br>
                > My money is on the apps eating up all RAM and the
                kernel killing the<br>
                > worker.<br>
                ><br>
                > The question is how we check whether this is true
                or not. Ideas?<br>
                ><br>
                > Yadu, can you do me a favor and package all the PBS
                output files from<br>
                > this run?<br>
                ><br>
                > Jonathan, can you see if you get the same errors
                with tasksPerWorker=8?<br>
                ><br>
                > Mihael<br>
                ><br>
                > On Thu, 2014-07-31 at 12:34 -0500, Jonathan Ozik
                wrote:<br>
                >> Sure thing, it’s attached below.<br>
                >><br>
                >> Jonathan<br>
                >><br>
                >><br>
                >> On Jul 31, 2014, at 12:09 PM, Mihael Hategan
                <<a moz-do-not-send="true" href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
                >> wrote:<br>
                >><br>
                >>> Hi Jonathan,<br>
                >>><br>
                >>> I can't see anything obvious in the worker
                logs, but they are pretty<br>
                >>> large. Can you also post the swift log from
                this run? It would make<br>
                >> it<br>
                >>> easier to focus on the right time frame.<br>
                >>><br>
                >>> Mihael<br>
                >>><br>
                >>> On Thu, 2014-07-31 at 08:35 -0500, Jonathan
                Ozik wrote:<br>
                >>>> Hi all,<br>
                >>>><br>
                >>>> I’m attaching the stdout and the worker
                logs below.<br>
                >>>><br>
                >>>> Thanks for looking at these!<br>
                >>>><br>
                >>>> Jonathan<br>
                >>>><br>
                >>>><br>
                >>>> On Jul 31, 2014, at 12:37 AM, Jonathan
                Ozik <<a moz-do-not-send="true" href="mailto:jozik@uchicago.edu">jozik@uchicago.edu</a>><br>
                >>>> wrote:<br>
                >>>><br>
                >>>>> Woops, sorry about that. It’s
                running now and the logs are being<br>
                >>>> generated. Once the run is done I’ll
                send you log files.<br>
                >>>>><br>
                >>>>> Thanks!<br>
                >>>>><br>
                >>>>> Jonathan<br>
                >>>>><br>
                >>>>> On Jul 31, 2014, at 12:12 AM,
                Mihael Hategan <<a moz-do-not-send="true" href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
                >>>> wrote:<br>
                >>>>><br>
                >>>>>> Right. This isn't your fault.
                We should, though, probably talk<br>
                >>>> about<br>
                >>>>>> addressing the issue.<br>
                >>>>>><br>
                >>>>>> Mihael<br>
                >>>>>><br>
                >>>>>> On Thu, 2014-07-31 at 00:10
                -0500, Yadu Nand Babuji wrote:<br>
                >>>>>>> Mihael, thanks for spotting
                that.  I added the comments to<br>
                >>>> highlight the<br>
                >>>>>>> changes in email.<br>
                >>>>>>><br>
                >>>>>>> -Yadu<br>
                >>>>>>><br>
                >>>>>>> On 07/31/2014 12:04 AM,
                Mihael Hategan wrote:<br>
                >>>>>>>> Hi Jonathan,<br>
                >>>>>>>><br>
                >>>>>>>> I suspect that the site
                config is considering the comment to be<br>
                >>>> part of<br>
                >>>>>>>> the value of the
                workerLogLevel property. We could confirm this<br>
                >>>> if you<br>
                >>>>>>>> send us the swift log
                from this particular run.<br>
                >>>>>>>><br>
                >>>>>>>> To fix it, you could
                try to remove everything after DEBUG<br>
                >>>> (including all<br>
                >>>>>>>> horizontal white
                space). In other words:<br>
                >>>>>>>><br>
                >>>>>>>> ...<br>
                >>>>>>>> workerloglevel=DEBUG<br>
                >>>>>>>>
                workerlogdirectory=/home/$USER/<br>
                >>>>>>>> ...<br>
                >>>>>>>><br>
                >>>>>>>> Mihael<br>
                >>>>>>>><br>
                >>>>>>>> On Wed, 2014-07-30 at
                23:50 -0500, Jonathan Ozik wrote:<br>
                >>>>>>>>> Hi Yadu,<br>
                >>>>>>>>><br>
                >>>>>>>>><br>
                >>>>>>>>> I’m getting errors
                indicating that DEBUG is an invalid worker<br>
                >>>> logging<br>
                >>>>>>>>> level. I’m
                attaching the stdout below. Let me know if I’m<br>
                >> doing<br>
                >>>>>>>>> something silly.<br>
                >>>>>>>>><br>
                >>>>>>>>><br>
                >>>>>>>>> Jonathan<br>
                >>>>>>>>><br>
                >>>>>>>>><br>
                >>>>>>>>> On Jul 30, 2014, at
                8:10 PM, Yadu Nand Babuji<br>
                >>>> <<a moz-do-not-send="true" href="mailto:yadunand@uchicago.edu">yadunand@uchicago.edu</a>><br>
                >>>>>>>>> wrote:<br>
                >>>>>>>>><br>
                >>>>>>>>>> Hi Jonathan,<br>
                >>>>>>>>>><br>
                >>>>>>>>>> I ran a couple
                of tests on Blues with swift-0.95-RC6 and do<br>
                >> not<br>
                >>>> see<br>
                >>>>>>>>>> anything
                unusual.<br>
                >>>>>>>>>><br>
                >>>>>>>>>> From your logs,
                it looks like workers are failing, so getting<br>
                >>>> worker<br>
                >>>>>>>>>> logs would
                help.<br>
                >>>>>>>>>> Could you try
                running on Blues with the following<br>
                >>>> swift.properties<br>
                >>>>>>>>>> and get us the
                worker*logs that would show up in the<br>
                >>>>>>>>>>
                workerlogdirectory ?<br>
                >>>>>>>>>><br>
                >>>>>>>>>> site=blues<br>
                >>>>>>>>>><br>
                >>>>>>>>>> site.blues {<br>
                >>>>>>>>>>  jobManager=pbs<br>
                >>>>>>>>>>
                 jobQueue=shared<br>
                >>>>>>>>>>  maxJobs=4<br>
                >>>>>>>>>>
                 jobGranularity=1<br>
                >>>>>>>>>>
                 maxNodesPerJob=1<br>
                >>>>>>>>>>
                 tasksPerWorker=16<br>
                >>>>>>>>>>
                 taskThrottle=64<br>
                >>>>>>>>>>
                 initialScore=10000<br>
                >>>>>>>>>>
                 jobWalltime=00:48:00<br>
                >>>>>>>>>>
                 taskWalltime=00:40:00<br>
                >>>>>>>>>>
                 workerloglevel=DEBUG                                  #<br>
                >>>> Adding<br>
                >>>>>>>>>> debug for
                workers<br>
                >>>>>>>>>>
                 workerlogdirectory=/home/$USER/                #
                Logging<br>
                >>>>>>>>>> directory on
                SFS<br>
                >>>>>>>>>>
                 workdir=$RUNDIRECTORY<br>
                >>>>>>>>>>
                 filesystem=local<br>
                >>>>>>>>>> }<br>
                >>>>>>>>>><br>
                >>>>>>>>>> Thanks,<br>
                >>>>>>>>>> Yadu<br>
                >>>>>>>>>><br>
                >>>>>>>>>><br>
                >>>>>>>>>> On 07/30/2014
                05:13 PM, Jonathan Ozik wrote:<br>
                >>>>>>>>>><br>
                >>>>>>>>>>> Hi Mike,<br>
                >>>>>>>>>>><br>
                >>>>>>>>>>><br>
                >>>>>>>>>>> Sorry, I
                figured there was some busy-ness involved!<br>
                >>>>>>>>>>> I ran the
                same test on Midway (using swift/0.95-RC5) and I<br>
                >>>> didn’t<br>
                >>>>>>>>>>> get the
                same issue. That is, the model run completed<br>
                >>>> successfully.<br>
                >>>>>>>>>>> For the
                Blues run, I used a trunk distribution (as of May<br>
                >> 29,<br>
                >>>>>>>>>>> 2014). I’m
                including one of the log files below. I’m also<br>
                >>>>>>>>>>> including
                the swift.properties file that was used for the<br>
                >>>> blues<br>
                >>>>>>>>>>> runs below.<br>
                >>>>>>>>>>><br>
                >>>>>>>>>>><br>
                >>>>>>>>>>> Thank you!<br>
                >>>>>>>>>>><br>
                >>>>>>>>>>><br>
                >>>>>>>>>>> Jonathan<br>
                >>>>>>>>>>><br>
                >>>>>>>>>>><br>
                >>>>>>>>>>><br>
                >>>>>>>>>>><br>
                >>>>>>>>>>> On Jul 30,
                2014, at 4:44 PM, Michael Wilde <<a moz-do-not-send="true" href="mailto:wilde@anl.gov">wilde@anl.gov</a>><br>
                >>>> wrote:<br>
                >>>>>>>>>>><br>
                >>>>>>>>>>>> Hi
                Jonathan,<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>> Yes,
                very sorry - we've been swamped. Thanks for the ping!<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>> I or
                one of the team will answer soon, on swift-user.<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>> (But
                the first question is: which Swift release, and can<br>
                >> you<br>
                >>>>>>>>>>>> point
                us to, or send, the full log file?)<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>> Thanks
                and regards,<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>> - Mike<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>> On
                7/30/14, 3:48 PM, Jonathan Ozik wrote:<br>
                >>>>>>>>>>>><br>
                >>>>>>>>>>>>> Hi
                Mike,<br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>>> I
                didn’t get a response yet so just wanted to make sure<br>
                >> that<br>
                >>>>>>>>>>>>> the
                message came across.<br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>>>
                Jonathan<br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>>>
                Begin forwarded message:<br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                From: Jonathan Ozik <<a moz-do-not-send="true" href="mailto:jozik@uchicago.edu">jozik@uchicago.edu</a>><br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                Subject: [Swift-user] exception @ swift-int.k, line:
                511,<br>
                >>>>>>>>>>>>>>
                Caused by: Block task failed: Connection to worker lost<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                Date: July 29, 2014 at 8:56:28 PM CDT<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                To: Mihael Hategan <<a moz-do-not-send="true" href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>>,<br>
                >>>>>>>>>>>>>>
                "<a moz-do-not-send="true" href="mailto:swift-user@ci.uchicago.edu">swift-user@ci.uchicago.edu</a>"
                <<a moz-do-not-send="true" href="mailto:swift-user@ci.uchicago.edu">swift-user@ci.uchicago.edu</a>><br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                Hi all,<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                I’m getting spurious errors in the jobs that I’m running<br>
                >> on<br>
                >>>>>>>>>>>>>>
                Blues. The stdout includes exceptions like:<br>
                >>>>>>>>>>>>>>
                exception @ swift-int.k, line: 511<br>
                >>>>>>>>>>>>>>
                Caused by: Block task failed: Connection to worker lost<br>
                >>>>>>>>>>>>>>
                java.io.IOException: Broken pipe<br>
                >>>>>>>>>>>>>>
                at sun.nio.ch.FileDispatcherImpl.write0(Native Method)<br>
                >>>>>>>>>>>>>>
                at<br>
                >>>>>>>>>>>>>><br>
                >>
                sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)<br>
                >>>>>>>>>>>>>>
                at<br>
                >>
                sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)<br>
                >>>>>>>>>>>>>>
                at sun.nio.ch.IOUtil.write(IOUtil.java:65)<br>
                >>>>>>>>>>>>>>
                at<br>
                >>>>>>>>>>>>>><br>
                >>>>
                sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)<br>
                >>>>>>>>>>>>>>
                at<br>
                >>>>>>>>>>>>>><br>
                >>>>
                org.globus.cog.coaster.channels.NIOSender.write(NIOSender.java:168)<br>
                >>>>>>>>>>>>>>
                at<br>
                >>>>>>>>>>>>>><br>
                >>>>
                org.globus.cog.coaster.channels.NIOSender.run(NIOSender.java:133)<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                These seem to occur at different parts of the submitted<br>
                >>>>>>>>>>>>>>
                jobs. Let me know if there’s a log file that you’d like<br>
                >> to<br>
                >>>>>>>>>>>>>>
                look at.<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                In earlier attempts I was getting these warnings
                followed<br>
                >>>> by<br>
                >>>>>>>>>>>>>>
                broken pipe errors:<br>
                >>>>>>>>>>>>>>
                Java HotSpot(TM) 64-Bit Server VM warning: INFO:<br>
                >>>>>>>>>>>>>>
                os::commit_memory(0x00000000a0000000, 704643072,
                2097152,<br>
                >>>> 0)<br>
                >>>>>>>>>>>>>>
                failed; error='Cannot allocate memory' (errno=12);
                Cannot<br>
                >>>>>>>>>>>>>>
                allocate large pages, falling back to regular pages<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                Apparently that’s a known precursor of crashes on Java 7<br>
                >> as<br>
                >>>>>>>>>>>>>>
                described here<br>
                >>>>>>>>>>>>>><br>
                >>>><br>
                >> (<a moz-do-not-send="true" href="http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html" target="_blank">http://www.oracle.com/technetwork/java/javase/7u51-relnotes-2085002.html</a>):<br>
                >>>>>>>>>>>>>>
                Area: hotspot/gc<br>
                >>>>>>>>>>>>>>
                Synopsis: Crashes due to failure to allocate large
                pages.<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                On Linux, failures when allocating large pages can lead<br>
                >> to<br>
                >>>>>>>>>>>>>>
                crashes. When running JDK 7u51 or later versions, the<br>
                >> issue<br>
                >>>>>>>>>>>>>>
                can be recognized in two ways:<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                • Before the crash happens one or more lines similar to<br>
                >>>> this<br>
                >>>>>>>>>>>>>>
                will have been printed to the log:<br>
                >>>>>>>>>>>>>>
                os::commit_memory(0x00000006b1600000, 352321536,
                2097152,<br>
                >>>> 0)<br>
                >>>>>>>>>>>>>>
                failed;<br>
                >>>>>>>>>>>>>>
                error='Cannot allocate memory' (errno=12); Cannot<br>
                >> allocate<br>
                >>>>>>>>>>>>>>
                large pages, falling back to regular pages<br>
                >>>>>>>>>>>>>>
                • If a hs_err file is generated it will contain a line<br>
                >>>>>>>>>>>>>>
                similar to this:<br>
                >>>>>>>>>>>>>>
                Large page allocation failures have occurred 3 times<br>
                >>>>>>>>>>>>>>
                The problem can be avoided by running with large page<br>
                >>>>>>>>>>>>>>
                support turned off, for example by passing the<br>
                >>>>>>>>>>>>>>
                "-XX:-UseLargePages" option to the java binary.<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                See 8007074 (not public).<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                So I added the -XX:-UseLargePages option in the<br>
                >> invocations<br>
                >>>>>>>>>>>>>>
                of Java code that I was responsible for. That seemed to<br>
                >> get<br>
                >>>>>>>>>>>>>>
                rid of the warning and the crashes for a while, but<br>
                >> perhaps<br>
                >>>>>>>>>>>>>>
                that was just a coincidence.<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                Jonathan<br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>>>
                _______________________________________________<br>
                >>>>>>>>>>>>>>
                Swift-user mailing list<br>
                >>>>>>>>>>>>>>
                <a moz-do-not-send="true" href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
                >>>>>>>>>>>>>><br>
                >>>> <a moz-do-not-send="true" href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
                >>>>>>>>>>>>>><br>
                >>>>>>>>>>>>><br>
                >>>>>>>>>>>> --<br>
                >>>>>>>>>>>> Michael
                Wilde<br>
                >>>>>>>>>>>>
                Mathematics and Computer Science          Computation<br>
                >>>> Institute<br>
                >>>>>>>>>>>> Argonne
                National Laboratory               The University of<br>
                >>>> Chicago<br>
                >>>>>>>>>>><br>
                >>>>>>>>>><br>
                >>>>>>>>><br>
                >>>>>>>><br>
                >>>>>>><br>
                >>>>>><br>
                >>>>>><br>
                >>>>><br>
                >>>><br>
                >>>><br>
                >>><br>
                >>><br>
                >><br>
                >><br>
                ><br>
                ><br>
                <br>
                _______________________________________________<br>
                Swift-user mailing list<br>
                <a moz-do-not-send="true" href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>
                <a moz-do-not-send="true" href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a><br>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Swift-user mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a>
<a class="moz-txt-link-freetext" href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</a></pre>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
  </div>

_______________________________________________<br>Swift-user mailing list<br><a href="mailto:Swift-user@ci.uchicago.edu">Swift-user@ci.uchicago.edu</a><br>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user</blockquote></div><br></div></div></body></html>