<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Thanks, Tim. But, stderr and stdout were both redirected to files,
    and all .err and .out files are empty.<br>
    <br>
    I'll double-check that these redirects are working, but in prior
    runs, I did indeed get app error and output in those files.<br>
    <br>
    It looks more to me like the app is failing to launch.<br>
    <br>
    - Mike<br>
    <br>
    <div class="moz-cite-prefix">On 10/2/14 10:22 PM, Tim Armstrong
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAC0jiV7NMkFCiiCdrO5FKH+4rdibKRosHsW5Peq8-3Hu4pdDhg@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <div dir="ltr">
        <div>The app stderr/stdout should go to Swift/T stderr/stdout
          unless redirected.  The problem is most likely discus
          returning a non-zero error code.<br>
          <br>
        </div>
        - Tim<br>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Thu, Oct 2, 2014 at 8:42 PM, Michael
          Wilde <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
            Im getting a failure from a Swift/T run on Titan that I cant
            diagnose.<br>
            <br>
            The script is running 16K DISCUS simulations, about 30 secs
            each, 16 per node, on 8 nodes. 127 workers, 1 Server. It
            seems to run well at smaller scale (eg 1K simulations).<br>
            <br>
            My PBS output file shows that some of the initial round of
            simulations fail.  It seems that when this happens, the
            Swift script exits within the first round of simulations,
            and none seem to even start.<br>
            <br>
            Some of the logs are below.  I'll continue to experiment and
            add more logging to try to isolate the cause.<br>
            <br>
            I was initially suspicious of an OOM problem, but I dont see
            any sign of that.<br>
            <br>
            In the line below "Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
            0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell
            what Swift things the exit status of these failing apps
            are?  What is causing the whole run to abort?<br>
            <br>
            It almost looks like one app, or a small number of them, are
            generating a SIGABRT, but thats not clear, and I have no
            stdout/err info from the app processes.<br>
            <br>
            I'll look for some more detailed debug options, and/or add
            some wrappers around the apps.<br>
            <br>
            Thanks,<br>
            <br>
            - Mike<br>
            <br>
            ---<br>
            <br>
            T$ cat output.txt.2115044.out<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
            0.004_0.264.inte 0.004_0.264.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
            0.004_0.272.inte 0.004_0.272.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1
            0.75 ../Si.cll 0.260_0.052.inte 0.260_0.052.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
            0.004_0.084.inte 0.004_0.084.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
            0.260_0.024.inte 0.260_0.024.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
            0.004_0.056.inte 0.004_0.056.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
            0.004_0.364.inte 0.004_0.364.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1
            0.75 ../Si.cll 0.004_0.072.inte 0.004_0.072.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            <br>
            Swift: external command failed:
            /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus

            --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
            0.260_0.044.inte 0.260_0.044.stru<br>
            <br>
            Swift: killing MPI job...<br>
            ADLB_Abort(1)<br>
            MPI_Abort(1)<br>
            Application 7706378 exit codes: 134<br>
            Application 7706378 exit signals: Killed<br>
            Application 7706378 resources: utime ~5s, stime ~19s, Rss
            ~11016, inblocks ~48986, outblocks ~205<br>
            T$<br>
            <br>
            ---<br>
            <br>
            T$ cat output.txt<br>
            bash_profile: loading modules<br>
            DB /ccs/home/wildemj/.modules: loading modules...<br>
            Turbine: turbine-aprun.sh<br>
            10/02/2014 09:15PM<br>
            <br>
            TURBINE_HOME:
            /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine<br>
            SCRIPT:
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl<br>
            PROCS:        128<br>
            NODES:        8<br>
            PPN:          16<br>
            WALLTIME:     00:15:00<br>
            <br>
            TURBINE_WORKERS: 127<br>
            ADLB_SERVERS:    1<br>
            <br>
            TCLSH:
            /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6<br>
            <br>
            JOB OUTPUT:<br>
            <br>
            Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 2<br>
            Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 4<br>
            Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 6<br>
            Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 3<br>
            Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 5<br>
            Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 1<br>
            Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0] application
            called MPI_Abort(MPI_COMM_WORLD, 1) - process 7<br>
            _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct  2
            21:15:32 2014] PE RANK 5 exit signal Aborted<br>
            _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct  2
            21:15:32 2014] PE RANK 3 exit signal Aborted<br>
            _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct  2
            21:15:32 2014] PE RANK 4 exit signal Aborted<br>
            _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct  2
            21:15:32 2014] PE RANK 2 exit signal Aborted<br>
            [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
            application termination<br>
            _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct  2
            21:15:32 2014] PE RANK 6 exit signal Aborted<br>
            T$<br>
            T$<br>
            T$<br>
            T$ cat turbine.log<br>
            JOB:               2115044<br>
            COMMAND:           stacking-faults3.tcl<br>
            HOSTNAME:          <a moz-do-not-send="true"
              href="http://ccs.ornl.gov" target="_blank">ccs.ornl.gov</a><br>
            SUBMITTED:         10/02/2014 09:14PM<br>
            PROCS:             128<br>
            PPN:               16<br>
            NODES:             8<br>
            TURBINE_WORKERS:<br>
            ADLB_SERVERS:<br>
            WALLTIME:          00:15:00<br>
            ADLB_EXHAUST_TIME:<br>
            T$<span class="HOEnZb"><font color="#888888"><br>
                <br>
                <br>
                <br>
                <br>
                -- <br>
                Michael Wilde<br>
                Mathematics and Computer Science          Computation
                Institute<br>
                Argonne National Laboratory               The University
                of Chicago<br>
                <br>
                _______________________________________________<br>
                ExM-user mailing list<br>
                <a moz-do-not-send="true"
                  href="mailto:ExM-user@lists.mcs.anl.gov"
                  target="_blank">ExM-user@lists.mcs.anl.gov</a><br>
                <a moz-do-not-send="true"
                  href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user"
                  target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
              </font></span></blockquote>
        </div>
        <br>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
ExM-user mailing list
<a class="moz-txt-link-abbreviated" href="mailto:ExM-user@lists.mcs.anl.gov">ExM-user@lists.mcs.anl.gov</a>
<a class="moz-txt-link-freetext" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a>
</pre>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
  </body>
</html>