<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Justin, can you report on your latest findings on this?<br>
    <br>
    (I will fwd Justin's latest notes from Monday...)<br>
    <br>
    - Mike<br>
    <br>
    <div class="moz-cite-prefix">On 10/15/14 4:27 PM, Tim Armstrong
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAC0jiV4q7HbZfSHf0_-UasZFhTLeWHWeOVoaY=d04GMs4Ss48Q@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div dir="ltr">Did you reach a resolution on this?  I'm back in
        town now and could look at adding the ability to retrieve the
        exit code.<br>
        <br>
        - Tim<br>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Fri, Oct 3, 2014 at 12:07 AM,
          Michael Wilde <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> Also: identical code
              seems to run fine at this setting:<br>
                PPN=16 turbine-cray-run.zsh -s titan.settings -n 64
              $tcl  # (4 nodes)<br>
              but fails at this setting:<br>
                PPN=16 turbine-cray-run.zsh -s titan.settings -n 128
              $tcl # (8 nodes)<br>
              <br>
              Changing the app that calls DISCUS directly back to a
              shell wrapper around DISCUS shows that when the job  fails
              the app (shell wrapper) is indeed executed, and logs a
              message to the app stdout file which I see. But I dont see
              a message after DISCUS.<br>
              <br>
              To test if perhaps many concurrent DISCUS invocations
              (127) are causing a problem which is fatal to the whole
              job, I added a random sleep before DISCUS.  This had no
              effect on the problem.<br>
              <br>
              Next I eliminated the actual DISCUS call, and just did a
              random sleep from 0 to 9 seconds. Here I discovered that
              all the jobs with sleep > 0 failed in the same manner
              as with discus, while the sleep 0 cases all printed all
              their echos and exited.<br>
              <br>
              So there is something more deeply wrong going on, having
              nothing to do with the DISCUS app. I'll try to narrow this
              down, and see if perhaps its happening on other non-Cray
              systems like Midway.<br>
              <br>
              - Mike
              <div>
                <div class="h5"><br>
                  <br>
                  <div>On 10/2/14 10:52 PM, Michael Wilde wrote:<br>
                  </div>
                  <blockquote type="cite"> Thanks, Tim. But, stderr and
                    stdout were both redirected to files, and all .err
                    and .out files are empty.<br>
                    <br>
                    I'll double-check that these redirects are working,
                    but in prior runs, I did indeed get app error and
                    output in those files.<br>
                    <br>
                    It looks more to me like the app is failing to
                    launch.<br>
                    <br>
                    - Mike<br>
                    <br>
                    <div>On 10/2/14 10:22 PM, Tim Armstrong wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>The app stderr/stdout should go to Swift/T
                          stderr/stdout unless redirected.  The problem
                          is most likely discus returning a non-zero
                          error code.<br>
                          <br>
                        </div>
                        - Tim<br>
                      </div>
                      <div class="gmail_extra"><br>
                        <div class="gmail_quote">On Thu, Oct 2, 2014 at
                          8:42 PM, Michael Wilde <span dir="ltr"><<a
                              moz-do-not-send="true"
                              href="mailto:wilde@anl.gov"
                              target="_blank">wilde@anl.gov</a>></span>
                          wrote:<br>
                          <blockquote class="gmail_quote"
                            style="margin:0 0 0 .8ex;border-left:1px
                            #ccc solid;padding-left:1ex"><br>
                            Im getting a failure from a Swift/T run on
                            Titan that I cant diagnose.<br>
                            <br>
                            The script is running 16K DISCUS
                            simulations, about 30 secs each, 16 per
                            node, on 8 nodes. 127 workers, 1 Server. It
                            seems to run well at smaller scale (eg 1K
                            simulations).<br>
                            <br>
                            My PBS output file shows that some of the
                            initial round of simulations fail.  It seems
                            that when this happens, the Swift script
                            exits within the first round of simulations,
                            and none seem to even start.<br>
                            <br>
                            Some of the logs are below.  I'll continue
                            to experiment and add more logging to try to
                            isolate the cause.<br>
                            <br>
                            I was initially suspicious of an OOM
                            problem, but I dont see any sign of that.<br>
                            <br>
                            In the line below "Swift: external command
                            failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004 0.264
                            1 0.75 ../Si.cll 0.004_0.264.inte
                            0.004_0.264.stru" - is there a way to tell
                            what Swift things the exit status of these
                            failing apps are?  What is causing the whole
                            run to abort?<br>
                            <br>
                            It almost looks like one app, or a small
                            number of them, are generating a SIGABRT,
                            but thats not clear, and I have no
                            stdout/err info from the app processes.<br>
                            <br>
                            I'll look for some more detailed debug
                            options, and/or add some wrappers around the
                            apps.<br>
                            <br>
                            Thanks,<br>
                            <br>
                            - Mike<br>
                            <br>
                            ---<br>
                            <br>
                            T$ cat output.txt.2115044.out<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004 0.264
                            1 0.75 ../Si.cll 0.004_0.264.inte
                            0.004_0.264.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004 0.272
                            1 0.75 ../Si.cll 0.004_0.272.inte
                            0.004_0.272.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.26
                            0.052000000000000005 1 0.75 ../Si.cll
                            0.260_0.052.inte 0.260_0.052.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004 0.084
                            1 0.75 ../Si.cll 0.004_0.084.inte
                            0.004_0.084.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.26 0.024 1
                            0.75 ../Si.cll 0.260_0.024.inte
                            0.260_0.024.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004 0.056
                            1 0.75 ../Si.cll 0.004_0.056.inte
                            0.004_0.056.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004 0.364
                            1 0.75 ../Si.cll 0.004_0.364.inte
                            0.004_0.364.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.004
                            0.07200000000000001 1 0.75 ../Si.cll
                            0.004_0.072.inte 0.004_0.072.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            <br>
                            Swift: external command failed:
                            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus



                            --macro ../stacking-faults3.mac 0.26 0.044 1
                            0.75 ../Si.cll 0.260_0.044.inte
                            0.260_0.044.stru<br>
                            <br>
                            Swift: killing MPI job...<br>
                            ADLB_Abort(1)<br>
                            MPI_Abort(1)<br>
                            Application 7706378 exit codes: 134<br>
                            Application 7706378 exit signals: Killed<br>
                            Application 7706378 resources: utime ~5s,
                            stime ~19s, Rss ~11016, inblocks ~48986,
                            outblocks ~205<br>
                            T$<br>
                            <br>
                            ---<br>
                            <br>
                            T$ cat output.txt<br>
                            bash_profile: loading modules<br>
                            DB /ccs/home/wildemj/.modules: loading
                            modules...<br>
                            Turbine: turbine-aprun.sh<br>
                            10/02/2014 09:15PM<br>
                            <br>
                            TURBINE_HOME:
                            /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine<br>
                            SCRIPT:
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl<br>
                            PROCS:        128<br>
                            NODES:        8<br>
                            PPN:          16<br>
                            WALLTIME:     00:15:00<br>
                            <br>
                            TURBINE_WORKERS: 127<br>
                            ADLB_SERVERS:    1<br>
                            <br>
                            TCLSH:
                            /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6<br>
                            <br>
                            JOB OUTPUT:<br>
                            <br>
                            Rank 2 [Thu Oct  2 21:15:32 2014]
                            [c12-0c1s1n3] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 2<br>
                            Rank 4 [Thu Oct  2 21:15:32 2014]
                            [c14-0c1s1n3] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 4<br>
                            Rank 6 [Thu Oct  2 21:15:32 2014]
                            [c14-0c1s1n1] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 6<br>
                            Rank 3 [Thu Oct  2 21:15:32 2014]
                            [c14-0c1s1n2] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 3<br>
                            Rank 5 [Thu Oct  2 21:15:32 2014]
                            [c14-0c1s1n0] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 5<br>
                            Rank 1 [Thu Oct  2 21:15:32 2014]
                            [c12-0c1s1n2] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 1<br>
                            Rank 7 [Thu Oct  2 21:15:32 2014]
                            [c14-0c1s0n0] application called
                            MPI_Abort(MPI_COMM_WORLD, 1) - process 7<br>
                            _pmiu_daemon(SIGCHLD): [NID 06114]
                            [c14-0c1s1n0] [Thu Oct  2 21:15:32 2014] PE
                            RANK 5 exit signal Aborted<br>
                            _pmiu_daemon(SIGCHLD): [NID 06076]
                            [c14-0c1s1n2] [Thu Oct  2 21:15:32 2014] PE
                            RANK 3 exit signal Aborted<br>
                            _pmiu_daemon(SIGCHLD): [NID 06077]
                            [c14-0c1s1n3] [Thu Oct  2 21:15:32 2014] PE
                            RANK 4 exit signal Aborted<br>
                            _pmiu_daemon(SIGCHLD): [NID 04675]
                            [c12-0c1s1n3] [Thu Oct  2 21:15:32 2014] PE
                            RANK 2 exit signal Aborted<br>
                            [NID 04675] 2014-10-02 21:15:32 Apid
                            7706378: initiated application termination<br>
                            _pmiu_daemon(SIGCHLD): [NID 06115]
                            [c14-0c1s1n1] [Thu Oct  2 21:15:32 2014] PE
                            RANK 6 exit signal Aborted<br>
                            T$<br>
                            T$<br>
                            T$<br>
                            T$ cat turbine.log<br>
                            JOB:               2115044<br>
                            COMMAND:           stacking-faults3.tcl<br>
                            HOSTNAME:          <a
                              moz-do-not-send="true"
                              href="http://ccs.ornl.gov" target="_blank">ccs.ornl.gov</a><br>
                            SUBMITTED:         10/02/2014 09:14PM<br>
                            PROCS:             128<br>
                            PPN:               16<br>
                            NODES:             8<br>
                            TURBINE_WORKERS:<br>
                            ADLB_SERVERS:<br>
                            WALLTIME:          00:15:00<br>
                            ADLB_EXHAUST_TIME:<br>
                            T$<span><font color="#888888"><br>
                                <br>
                                <br>
                                <br>
                                <br>
                                -- <br>
                                Michael Wilde<br>
                                Mathematics and Computer Science       
                                  Computation Institute<br>
                                Argonne National Laboratory             
                                 The University of Chicago<br>
                                <br>
_______________________________________________<br>
                                ExM-user mailing list<br>
                                <a moz-do-not-send="true"
                                  href="mailto:ExM-user@lists.mcs.anl.gov"
                                  target="_blank">ExM-user@lists.mcs.anl.gov</a><br>
                                <a moz-do-not-send="true"
                                  href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user"
                                  target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
                              </font></span></blockquote>
                        </div>
                        <br>
                      </div>
                      <br>
                      <fieldset></fieldset>
                      <br>
                      <pre>_______________________________________________
ExM-user mailing list
<a moz-do-not-send="true" href="mailto:ExM-user@lists.mcs.anl.gov" target="_blank">ExM-user@lists.mcs.anl.gov</a>
<a moz-do-not-send="true" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a>
</pre>
                    </blockquote>
                    <br>
                    <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
                  </blockquote>
                  <br>
                  <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
                </div>
              </div>
            </div>
            <br>
            _______________________________________________<br>
            ExM-user mailing list<br>
            <a moz-do-not-send="true"
              href="mailto:ExM-user@lists.mcs.anl.gov">ExM-user@lists.mcs.anl.gov</a><br>
            <a moz-do-not-send="true"
              href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user"
              target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
  </body>
</html>