<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix"><br>
      I'm working on it.  I'm having trouble getting the modules-based
      cc wrapper to compile simple things.  I'm reading posts like this
      one from people with similar issues: <br>
      <br>
      <a class="moz-txt-link-freetext" href="http://public.kitware.com/pipermail/paraview/2013-June/028679.html">http://public.kitware.com/pipermail/paraview/2013-June/028679.html</a><br>
      <br>
      On 10/16/2014 01:21 PM, Michael Wilde wrote:<br>
    </div>
    <blockquote cite="mid:54400C9B.2020109@anl.gov" type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      Tim, Justin, have you done the static build on Titan, and if so,
      has it resolved the problem?<br>
      <br>
      - Mike<br>
      <br>
      <div class="moz-cite-prefix">On 10/15/14 5:22 PM, Justin M Wozniak
        wrote:<br>
      </div>
      <blockquote
        cite="mid:65cdff0a-1ef6-4d8b-a579-a1fbd762dc60@email.android.com"
        type="cite"> <br>
        Bisection showed that loading the Turbine so from plain Tcl
        breaks the OS. <br>
        <br>
        I had OLCF reset my shell to a blank bash so I can make pure use
        of modules. I added support for this in the configuration.<br>
        <br>
        I have been in touch with Tim today about doing a static build
        on the Cray.<br>
        <br>
        <br>
        <div class="gmail_quote">On October 15, 2014 4:31:50 PM CDT,
          Michael Wilde <a moz-do-not-send="true"
            class="moz-txt-link-rfc2396E" href="mailto:wilde@anl.gov"><wilde@anl.gov></a>
          wrote:
          <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
            0.8ex; border-left: 1px solid rgb(204, 204, 204);
            padding-left: 1ex;">
            <div class="moz-forward-container">Status as of Monday (but
              I know Justin did a few commits since...)<br>
              <br>
              -------- Forwarded Message --------
              <table class="moz-email-headers-table" border="0"
                cellpadding="0" cellspacing="0">
                <tbody>
                  <tr>
                    <th valign="BASELINE" align="RIGHT" nowrap="nowrap">Subject:


                    </th>
                    <td>Re: Latest on the Swift/T Titan problem?</td>
                  </tr>
                  <tr>
                    <th valign="BASELINE" align="RIGHT" nowrap="nowrap">Date:

                    </th>
                    <td>Mon, 13 Oct 2014 16:11:31 -0500</td>
                  </tr>
                  <tr>
                    <th valign="BASELINE" align="RIGHT" nowrap="nowrap">From:

                    </th>
                    <td>Justin M Wozniak <a moz-do-not-send="true"
                        class="moz-txt-link-rfc2396E"
                        href="mailto:wozniak@mcs.anl.gov"><wozniak@mcs.anl.gov></a></td>
                  </tr>
                  <tr>
                    <th valign="BASELINE" align="RIGHT" nowrap="nowrap">To:

                    </th>
                    <td>Michael Wilde <a moz-do-not-send="true"
                        class="moz-txt-link-rfc2396E"
                        href="mailto:wilde@anl.gov"><wilde@anl.gov></a></td>
                  </tr>
                </tbody>
              </table>
              <br>
              <br>
              <pre>Pure shell: works
Pure Tcl: works
ADLB batcher: works
<------->
Swift/T: fails

Something in Turbine breaks the OS.  I have been bisecting over a 
stripped-down Turbine to identify exactly what this is.  If I strip down 
the Turbine link step, I can get it to work.  So I think it is something 
in our use of Cray modules.  (I try to not use modules/softenv but it is 
affecting the link.)

This may fix the Swift/NAMD issue, which would be a complementary 
positive outcome.

On 10/13/2014 12:10 PM, Michael Wilde wrote:
> One avenue I was wondering about for debugging:  What PrgEnv is the 
> Swift/T on Titan built for?
> Im assuming gnu, but if so, how does that interact with the 
> PrgEnv-intel needed for the DISCUS app?
>
> I would think to handle such cases we need to run Turbine under the 
> PrgEnv it needs, and then run any app( ) calls that needs a different 
> PrgEnv under a wrapper shell script that provides it, just for that 
> process.
>
> Which may in turn raise the issue of whether "module load 
> PrgEnv-intel" for example is scalable to hundredes to tens of 
> thousands of concurrent invocations.
>
> - Mike
>
> On 10/13/14 11:38 AM, Justin M Wozniak wrote:
>>
>> Yes, I am testing launching tasks with a plain shell script and a 
>> plain Tcl script.  I will then move to the ADLB batcher program.
>>
>> On 10/13/2014 11:34 AM, Michael Wilde wrote:
>>> Hi Justin,
>>>
>>> I did not get to spend any time on this over the weekend.
>>>
>>> What is your latest assessment of the issue?
>>>
>>> I did some runs Fri night that were disturbing: two runs of 256 
>>> sleeps on 128 cores, each yielding a different number of output 
>>> files (but missing ~ 5 in one case and ~ 10 in the other).
>>>
>>> I was going to do a test of a simple MPI app forking tasks with 
>>> system( ), to see if same failure pattern is seen or not, as #CPUs 
>>> increases from 128 to 1024.
>>>
...

-- 
Justin M Wozniak

</pre>
              <br>
            </div>
            <div class="moz-cite-prefix">On 10/15/14 4:29 PM, Michael
              Wilde wrote:<br>
            </div>
            <blockquote cite="mid:543EE736.5020301@anl.gov" type="cite">
              Justin, can you report on your latest findings on this?<br>
              <br>
              (I will fwd Justin's latest notes from Monday...)<br>
              <br>
              - Mike<br>
              <br>
              <div class="moz-cite-prefix">On 10/15/14 4:27 PM, Tim
                Armstrong wrote:<br>
              </div>
              <blockquote
cite="mid:CAC0jiV4q7HbZfSHf0_-UasZFhTLeWHWeOVoaY=d04GMs4Ss48Q@mail.gmail.com"
                type="cite">
                <div dir="ltr">Did you reach a resolution on this?  I'm
                  back in town now and could look at adding the ability
                  to retrieve the exit code.<br>
                  <br>
                  - Tim<br>
                </div>
                <div class="gmail_extra"><br>
                  <div class="gmail_quote">On Fri, Oct 3, 2014 at 12:07
                    AM, Michael Wilde <span dir="ltr"><<a
                        moz-do-not-send="true"
                        href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <div bgcolor="#FFFFFF" text="#000000"> Also:
                        identical code seems to run fine at this
                        setting:<br>
                          PPN=16 turbine-cray-run.zsh -s titan.settings
                        -n 64 $tcl  # (4 nodes)<br>
                        but fails at this setting:<br>
                          PPN=16 turbine-cray-run.zsh -s titan.settings
                        -n 128 $tcl # (8 nodes)<br>
                        <br>
                        Changing the app that calls DISCUS directly back
                        to a shell wrapper around DISCUS shows that when
                        the job  fails the app (shell wrapper) is indeed
                        executed, and logs a message to the app stdout
                        file which I see. But I dont see a message after
                        DISCUS.<br>
                        <br>
                        To test if perhaps many concurrent DISCUS
                        invocations (127) are causing a problem which is
                        fatal to the whole job, I added a random sleep
                        before DISCUS.  This had no effect on the
                        problem.<br>
                        <br>
                        Next I eliminated the actual DISCUS call, and
                        just did a random sleep from 0 to 9 seconds.
                        Here I discovered that all the jobs with sleep
                        > 0 failed in the same manner as with discus,
                        while the sleep 0 cases all printed all their
                        echos and exited.<br>
                        <br>
                        So there is something more deeply wrong going
                        on, having nothing to do with the DISCUS app.
                        I'll try to narrow this down, and see if perhaps
                        its happening on other non-Cray systems like
                        Midway.<br>
                        <br>
                        - Mike
                        <div>
                          <div class="h5"><br>
                            <br>
                            <div>On 10/2/14 10:52 PM, Michael Wilde
                              wrote:<br>
                            </div>
                            <blockquote type="cite"> Thanks, Tim. But,
                              stderr and stdout were both redirected to
                              files, and all .err and .out files are
                              empty.<br>
                              <br>
                              I'll double-check that these redirects are
                              working, but in prior runs, I did indeed
                              get app error and output in those files.<br>
                              <br>
                              It looks more to me like the app is
                              failing to launch.<br>
                              <br>
                              - Mike<br>
                              <br>
                              <div>On 10/2/14 10:22 PM, Tim Armstrong
                                wrote:<br>
                              </div>
                              <blockquote type="cite">
                                <div dir="ltr">
                                  <div>The app stderr/stdout should go
                                    to Swift/T stderr/stdout unless
                                    redirected.  The problem is most
                                    likely discus returning a non-zero
                                    error code.<br>
                                    <br>
                                  </div>
                                  - Tim<br>
                                </div>
                                <div class="gmail_extra"><br>
                                  <div class="gmail_quote">On Thu, Oct
                                    2, 2014 at 8:42 PM, Michael Wilde <span
                                      dir="ltr"><<a
                                        moz-do-not-send="true"
                                        href="mailto:wilde@anl.gov"
                                        target="_blank">wilde@anl.gov</a>></span>
                                    wrote:<br>
                                    <blockquote class="gmail_quote"
                                      style="margin:0 0 0
                                      .8ex;border-left:1px #ccc
                                      solid;padding-left:1ex"><br>
                                      Im getting a failure from a
                                      Swift/T run on Titan that I cant
                                      diagnose.<br>
                                      <br>
                                      The script is running 16K DISCUS
                                      simulations, about 30 secs each,
                                      16 per node, on 8 nodes. 127
                                      workers, 1 Server. It seems to run
                                      well at smaller scale (eg 1K
                                      simulations).<br>
                                      <br>
                                      My PBS output file shows that some
                                      of the initial round of
                                      simulations fail.  It seems that
                                      when this happens, the Swift
                                      script exits within the first
                                      round of simulations, and none
                                      seem to even start.<br>
                                      <br>
                                      Some of the logs are below.  I'll
                                      continue to experiment and add
                                      more logging to try to isolate the
                                      cause.<br>
                                      <br>
                                      I was initially suspicious of an
                                      OOM problem, but I dont see any
                                      sign of that.<br>
                                      <br>
                                      In the line below "Swift: external
                                      command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.264 1 0.75 ../Si.cll
                                      0.004_0.264.inte 0.004_0.264.stru"
                                      - is there a way to tell what
                                      Swift things the exit status of
                                      these failing apps are?  What is
                                      causing the whole run to abort?<br>
                                      <br>
                                      It almost looks like one app, or a
                                      small number of them, are
                                      generating a SIGABRT, but thats
                                      not clear, and I have no
                                      stdout/err info from the app
                                      processes.<br>
                                      <br>
                                      I'll look for some more detailed
                                      debug options, and/or add some
                                      wrappers around the apps.<br>
                                      <br>
                                      Thanks,<br>
                                      <br>
                                      - Mike<br>
                                      <br>
                                      ---<br>
                                      <br>
                                      T$ cat output.txt.2115044.out<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.264 1 0.75 ../Si.cll
                                      0.004_0.264.inte 0.004_0.264.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.272 1 0.75 ../Si.cll
                                      0.004_0.272.inte 0.004_0.272.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.26 0.052000000000000005 1 0.75
                                      ../Si.cll 0.260_0.052.inte
                                      0.260_0.052.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.084 1 0.75 ../Si.cll
                                      0.004_0.084.inte 0.004_0.084.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.26 0.024 1 0.75 ../Si.cll
                                      0.260_0.024.inte 0.260_0.024.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.056 1 0.75 ../Si.cll
                                      0.004_0.056.inte 0.004_0.056.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.364 1 0.75 ../Si.cll
                                      0.004_0.364.inte 0.004_0.364.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.004 0.07200000000000001 1 0.75
                                      ../Si.cll 0.004_0.072.inte
                                      0.004_0.072.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      <br>
                                      Swift: external command failed:
                                      /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus






                                      --macro ../stacking-faults3.mac
                                      0.26 0.044 1 0.75 ../Si.cll
                                      0.260_0.044.inte 0.260_0.044.stru<br>
                                      <br>
                                      Swift: killing MPI job...<br>
                                      ADLB_Abort(1)<br>
                                      MPI_Abort(1)<br>
                                      Application 7706378 exit codes:
                                      134<br>
                                      Application 7706378 exit signals:
                                      Killed<br>
                                      Application 7706378 resources:
                                      utime ~5s, stime ~19s, Rss ~11016,
                                      inblocks ~48986, outblocks ~205<br>
                                      T$<br>
                                      <br>
                                      ---<br>
                                      <br>
                                      T$ cat output.txt<br>
                                      bash_profile: loading modules<br>
                                      DB /ccs/home/wildemj/.modules:
                                      loading modules...<br>
                                      Turbine: turbine-aprun.sh<br>
                                      10/02/2014 09:15PM<br>
                                      <br>
                                      TURBINE_HOME:
                                      /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine<br>
                                      SCRIPT:
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl<br>
                                      PROCS:        128<br>
                                      NODES:        8<br>
                                      PPN:          16<br>
                                      WALLTIME:     00:15:00<br>
                                      <br>
                                      TURBINE_WORKERS: 127<br>
                                      ADLB_SERVERS:    1<br>
                                      <br>
                                      TCLSH:
                                      /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6<br>
                                      <br>
                                      JOB OUTPUT:<br>
                                      <br>
                                      Rank 2 [Thu Oct  2 21:15:32 2014]
                                      [c12-0c1s1n3] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 2<br>
                                      Rank 4 [Thu Oct  2 21:15:32 2014]
                                      [c14-0c1s1n3] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 4<br>
                                      Rank 6 [Thu Oct  2 21:15:32 2014]
                                      [c14-0c1s1n1] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 6<br>
                                      Rank 3 [Thu Oct  2 21:15:32 2014]
                                      [c14-0c1s1n2] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 3<br>
                                      Rank 5 [Thu Oct  2 21:15:32 2014]
                                      [c14-0c1s1n0] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 5<br>
                                      Rank 1 [Thu Oct  2 21:15:32 2014]
                                      [c12-0c1s1n2] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 1<br>
                                      Rank 7 [Thu Oct  2 21:15:32 2014]
                                      [c14-0c1s0n0] application called
                                      MPI_Abort(MPI_COMM_WORLD, 1) -
                                      process 7<br>
                                      _pmiu_daemon(SIGCHLD): [NID 06114]
                                      [c14-0c1s1n0] [Thu Oct  2 21:15:32
                                      2014] PE RANK 5 exit signal
                                      Aborted<br>
                                      _pmiu_daemon(SIGCHLD): [NID 06076]
                                      [c14-0c1s1n2] [Thu Oct  2 21:15:32
                                      2014] PE RANK 3 exit signal
                                      Aborted<br>
                                      _pmiu_daemon(SIGCHLD): [NID 06077]
                                      [c14-0c1s1n3] [Thu Oct  2 21:15:32
                                      2014] PE RANK 4 exit signal
                                      Aborted<br>
                                      _pmiu_daemon(SIGCHLD): [NID 04675]
                                      [c12-0c1s1n3] [Thu Oct  2 21:15:32
                                      2014] PE RANK 2 exit signal
                                      Aborted<br>
                                      [NID 04675] 2014-10-02 21:15:32
                                      Apid 7706378: initiated
                                      application termination<br>
                                      _pmiu_daemon(SIGCHLD): [NID 06115]
                                      [c14-0c1s1n1] [Thu Oct  2 21:15:32
                                      2014] PE RANK 6 exit signal
                                      Aborted<br>
                                      T$<br>
                                      T$<br>
                                      T$<br>
                                      T$ cat turbine.log<br>
                                      JOB:               2115044<br>
                                      COMMAND:         
                                       stacking-faults3.tcl<br>
                                      HOSTNAME:          <a
                                        moz-do-not-send="true"
                                        href="http://ccs.ornl.gov"
                                        target="_blank">ccs.ornl.gov</a><br>
                                      SUBMITTED:         10/02/2014
                                      09:14PM<br>
                                      PROCS:             128<br>
                                      PPN:               16<br>
                                      NODES:             8<br>
                                      TURBINE_WORKERS:<br>
                                      ADLB_SERVERS:<br>
                                      WALLTIME:          00:15:00<br>
                                      ADLB_EXHAUST_TIME:<br>
                                      T$<span><font color="#888888"><br>
                                          <br>
                                          <br>
                                          <br>
                                          <br>
                                          -- <br>
                                          Michael Wilde<br>
                                          Mathematics and Computer
                                          Science          Computation
                                          Institute<br>
                                          Argonne National Laboratory   
                                                     The University of
                                          Chicago<br>
                                          <br>
_______________________________________________<br>
                                          ExM-user mailing list<br>
                                          <a moz-do-not-send="true"
                                            href="mailto:ExM-user@lists.mcs.anl.gov"
                                            target="_blank">ExM-user@lists.mcs.anl.gov</a><br>
                                          <a moz-do-not-send="true"
                                            href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user"
                                            target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
                                        </font></span></blockquote>
                                  </div>
                                  <br>
                                </div>
                                <br>
                                <fieldset></fieldset>
                                <br>
                                <pre>_______________________________________________
ExM-user mailing list
<a moz-do-not-send="true" href="mailto:ExM-user@lists.mcs.anl.gov" target="_blank">ExM-user@lists.mcs.anl.gov</a>
<a moz-do-not-send="true" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a>
</pre>
                              </blockquote>
                              <br>
                              <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
                            </blockquote>
                            <br>
                            <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
                          </div>
                        </div>
                      </div>
                      <br>
                      _______________________________________________<br>
                      ExM-user mailing list<br>
                      <a moz-do-not-send="true"
                        href="mailto:ExM-user@lists.mcs.anl.gov">ExM-user@lists.mcs.anl.gov</a><br>
                      <a moz-do-not-send="true"
                        href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user"
                        target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
                      <br>
                    </blockquote>
                  </div>
                  <br>
                </div>
              </blockquote>
              <br>
              <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
            </blockquote>
            <br>
            <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
            <pre class="k9mail"><hr>
ExM-user mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:ExM-user@lists.mcs.anl.gov">ExM-user@lists.mcs.anl.gov</a>
<a moz-do-not-send="true" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a>
</pre>
          </blockquote>
        </div>
        <br>
        -- <br>
        Justin M Wozniak (via phone) </blockquote>
      <br>
      <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
    </blockquote>
    <br>
    <br>
    <pre class="moz-signature" cols="72">-- 
Justin M Wozniak</pre>
  </body>
</html>