<div dir="ltr">Did you reach a resolution on this?  I'm back in town now and could look at adding the ability to retrieve the exit code.<br><br>- Tim<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <span dir="ltr"><<a href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    Also: identical code seems to run fine at this setting:<br>
      PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl  # (4
    nodes)<br>
    but fails at this setting:<br>
      PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8
    nodes)<br>
    <br>
    Changing the app that calls DISCUS directly back to a shell wrapper
    around DISCUS shows that when the job  fails the app (shell wrapper)
    is indeed executed, and logs a message to the app stdout file which
    I see. But I dont see a message after DISCUS.<br>
    <br>
    To test if perhaps many concurrent DISCUS invocations (127) are
    causing a problem which is fatal to the whole job, I added a random
    sleep before DISCUS.  This had no effect on the problem.<br>
    <br>
    Next I eliminated the actual DISCUS call, and just did a random
    sleep from 0 to 9 seconds. Here I discovered that all the jobs with
    sleep > 0 failed in the same manner as with discus, while the
    sleep 0 cases all printed all their echos and exited.<br>
    <br>
    So there is something more deeply wrong going on, having nothing to
    do with the DISCUS app. I'll try to narrow this down, and see if
    perhaps its happening on other non-Cray systems like Midway.<br>
    <br>
    - Mike<div><div class="h5"><br>
    <br>
    <div>On 10/2/14 10:52 PM, Michael Wilde
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      Thanks, Tim. But, stderr and stdout were both redirected to files,
      and all .err and .out files are empty.<br>
      <br>
      I'll double-check that these redirects are working, but in prior
      runs, I did indeed get app error and output in those files.<br>
      <br>
      It looks more to me like the app is failing to launch.<br>
      <br>
      - Mike<br>
      <br>
      <div>On 10/2/14 10:22 PM, Tim Armstrong
        wrote:<br>
      </div>
      <blockquote type="cite">
        
        <div dir="ltr">
          <div>The app stderr/stdout should go to Swift/T stderr/stdout
            unless redirected.  The problem is most likely discus
            returning a non-zero error code.<br>
            <br>
          </div>
          - Tim<br>
        </div>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On Thu, Oct 2, 2014 at 8:42 PM,
            Michael Wilde <span dir="ltr"><<a href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
              Im getting a failure from a Swift/T run on Titan that I
              cant diagnose.<br>
              <br>
              The script is running 16K DISCUS simulations, about 30
              secs each, 16 per node, on 8 nodes. 127 workers, 1 Server.
              It seems to run well at smaller scale (eg 1K simulations).<br>
              <br>
              My PBS output file shows that some of the initial round of
              simulations fail.  It seems that when this happens, the
              Swift script exits within the first round of simulations,
              and none seem to even start.<br>
              <br>
              Some of the logs are below.  I'll continue to experiment
              and add more logging to try to isolate the cause.<br>
              <br>
              I was initially suspicious of an OOM problem, but I dont
              see any sign of that.<br>
              <br>
              In the line below "Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
              ../Si.cll 0.004_0.264.inte 0.004_0.264.stru" - is there a
              way to tell what Swift things the exit status of these
              failing apps are?  What is causing the whole run to abort?<br>
              <br>
              It almost looks like one app, or a small number of them,
              are generating a SIGABRT, but thats not clear, and I have
              no stdout/err info from the app processes.<br>
              <br>
              I'll look for some more detailed debug options, and/or add
              some wrappers around the apps.<br>
              <br>
              Thanks,<br>
              <br>
              - Mike<br>
              <br>
              ---<br>
              <br>
              T$ cat output.txt.2115044.out<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
              ../Si.cll 0.004_0.264.inte 0.004_0.264.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.272 1 0.75
              ../Si.cll 0.004_0.272.inte 0.004_0.272.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.26 0.052000000000000005
              1 0.75 ../Si.cll 0.260_0.052.inte 0.260_0.052.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.084 1 0.75
              ../Si.cll 0.004_0.084.inte 0.004_0.084.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.26 0.024 1 0.75
              ../Si.cll 0.260_0.024.inte 0.260_0.024.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.056 1 0.75
              ../Si.cll 0.004_0.056.inte 0.004_0.056.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.364 1 0.75
              ../Si.cll 0.004_0.364.inte 0.004_0.364.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.004 0.07200000000000001
              1 0.75 ../Si.cll 0.004_0.072.inte 0.004_0.072.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              <br>
              Swift: external command failed:
              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus


              --macro ../stacking-faults3.mac 0.26 0.044 1 0.75
              ../Si.cll 0.260_0.044.inte 0.260_0.044.stru<br>
              <br>
              Swift: killing MPI job...<br>
              ADLB_Abort(1)<br>
              MPI_Abort(1)<br>
              Application 7706378 exit codes: 134<br>
              Application 7706378 exit signals: Killed<br>
              Application 7706378 resources: utime ~5s, stime ~19s, Rss
              ~11016, inblocks ~48986, outblocks ~205<br>
              T$<br>
              <br>
              ---<br>
              <br>
              T$ cat output.txt<br>
              bash_profile: loading modules<br>
              DB /ccs/home/wildemj/.modules: loading modules...<br>
              Turbine: turbine-aprun.sh<br>
              10/02/2014 09:15PM<br>
              <br>
              TURBINE_HOME:
              /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine<br>
              SCRIPT:
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl<br>
              PROCS:        128<br>
              NODES:        8<br>
              PPN:          16<br>
              WALLTIME:     00:15:00<br>
              <br>
              TURBINE_WORKERS: 127<br>
              ADLB_SERVERS:    1<br>
              <br>
              TCLSH:
              /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6<br>
              <br>
              JOB OUTPUT:<br>
              <br>
              Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              2<br>
              Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              4<br>
              Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              6<br>
              Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              3<br>
              Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              5<br>
              Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              1<br>
              Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0]
              application called MPI_Abort(MPI_COMM_WORLD, 1) - process
              7<br>
              _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct 
              2 21:15:32 2014] PE RANK 5 exit signal Aborted<br>
              _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct 
              2 21:15:32 2014] PE RANK 3 exit signal Aborted<br>
              _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct 
              2 21:15:32 2014] PE RANK 4 exit signal Aborted<br>
              _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct 
              2 21:15:32 2014] PE RANK 2 exit signal Aborted<br>
              [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
              application termination<br>
              _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct 
              2 21:15:32 2014] PE RANK 6 exit signal Aborted<br>
              T$<br>
              T$<br>
              T$<br>
              T$ cat turbine.log<br>
              JOB:               2115044<br>
              COMMAND:           stacking-faults3.tcl<br>
              HOSTNAME:          <a href="http://ccs.ornl.gov" target="_blank">ccs.ornl.gov</a><br>
              SUBMITTED:         10/02/2014 09:14PM<br>
              PROCS:             128<br>
              PPN:               16<br>
              NODES:             8<br>
              TURBINE_WORKERS:<br>
              ADLB_SERVERS:<br>
              WALLTIME:          00:15:00<br>
              ADLB_EXHAUST_TIME:<br>
              T$<span><font color="#888888"><br>
                  <br>
                  <br>
                  <br>
                  <br>
                  -- <br>
                  Michael Wilde<br>
                  Mathematics and Computer Science          Computation
                  Institute<br>
                  Argonne National Laboratory               The
                  University of Chicago<br>
                  <br>
                  _______________________________________________<br>
                  ExM-user mailing list<br>
                  <a href="mailto:ExM-user@lists.mcs.anl.gov" target="_blank">ExM-user@lists.mcs.anl.gov</a><br>
                  <a href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
                </font></span></blockquote>
          </div>
          <br>
        </div>
        <br>
        <fieldset></fieldset>
        <br>
        <pre>_______________________________________________
ExM-user mailing list
<a href="mailto:ExM-user@lists.mcs.anl.gov" target="_blank">ExM-user@lists.mcs.anl.gov</a>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a>
</pre>
      </blockquote>
      <br>
      <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
    </blockquote>
    <br>
    <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
  </div></div></div>

<br>_______________________________________________<br>
ExM-user mailing list<br>
<a href="mailto:ExM-user@lists.mcs.anl.gov">ExM-user@lists.mcs.anl.gov</a><br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br>
<br></blockquote></div><br></div>