<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body bgcolor="#FFFFFF" text="#000000"><br>
Bisection showed that loading the Turbine so from plain Tcl breaks the OS. <br>
<br>
I had OLCF reset my shell to a blank bash so I can make pure use of modules.  I added support for this in the configuration.<br>
<br>
I have been in touch with Tim today about doing a static build on the Cray.<br>
<br><br><div class="gmail_quote">On October 15, 2014 4:31:50 PM CDT, Michael Wilde <wilde@anl.gov> wrote:<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">


  
  
    <div class="moz-forward-container">Status as of Monday (but I know
      Justin did a few commits since...)<br />
      <br />
      -------- Forwarded Message --------
      
        
          
            
            
          
          
            
            
          
          
            
            
          
          
            
            
          
        
      <table class="moz-email-headers-table" border="0" cellpadding="0" cellspacing="0"><tbody><tr><th align="RIGHT" nowrap="nowrap" valign="BASELINE">Subject:
            </th><td>Re: Latest on the Swift/T Titan problem?</td></tr><tr><th align="RIGHT" nowrap="nowrap" valign="BASELINE">Date: </th><td>Mon, 13 Oct 2014 16:11:31 -0500</td></tr><tr><th align="RIGHT" nowrap="nowrap" valign="BASELINE">From: </th><td>Justin M Wozniak <a class="moz-txt-link-rfc2396E" href="mailto:wozniak@mcs.anl.gov"><wozniak@mcs.anl.gov></a></td></tr><tr><th align="RIGHT" nowrap="nowrap" valign="BASELINE">To: </th><td>Michael Wilde <a class="moz-txt-link-rfc2396E" href="mailto:wilde@anl.gov"><wilde@anl.gov></a></td></tr></tbody></table>
      <br />
      <br />
      <pre>Pure shell: works
Pure Tcl: works
ADLB batcher: works
<------->
Swift/T: fails

Something in Turbine breaks the OS.  I have been bisecting over a 
stripped-down Turbine to identify exactly what this is.  If I strip down 
the Turbine link step, I can get it to work.  So I think it is something 
in our use of Cray modules.  (I try to not use modules/softenv but it is 
affecting the link.)

This may fix the Swift/NAMD issue, which would be a complementary 
positive outcome.

On 10/13/2014 12:10 PM, Michael Wilde wrote:
> One avenue I was wondering about for debugging:  What PrgEnv is the 
> Swift/T on Titan built for?
> Im assuming gnu, but if so, how does that interact with the 
> PrgEnv-intel needed for the DISCUS app?
>
> I would think to handle such cases we need to run Turbine under the 
> PrgEnv it needs, and then run any app( ) calls that needs a different 
> PrgEnv under a wrapper shell script that provides it, just for that 
> process.
>
> Which may in turn raise the issue of whether "module load 
> PrgEnv-intel" for example is scalable to hundredes to tens of 
> thousands of concurrent invocations.
>
> - Mike
>
> On 10/13/14 11:38 AM, Justin M Wozniak wrote:
>>
>> Yes, I am testing launching tasks with a plain shell script and a 
>> plain Tcl script.  I will then move to the ADLB batcher program.
>>
>> On 10/13/2014 11:34 AM, Michael Wilde wrote:
>>> Hi Justin,
>>>
>>> I did not get to spend any time on this over the weekend.
>>>
>>> What is your latest assessment of the issue?
>>>
>>> I did some runs Fri night that were disturbing: two runs of 256 
>>> sleeps on 128 cores, each yielding a different number of output 
>>> files (but missing ~ 5 in one case and ~ 10 in the other).
>>>
>>> I was going to do a test of a simple MPI app forking tasks with 
>>> system( ), to see if same failure pattern is seen or not, as #CPUs 
>>> increases from 128 to 1024.
>>>
...

-- 
Justin M Wozniak

</pre>
      <br />
    </div>
    <div class="moz-cite-prefix">On 10/15/14 4:29 PM, Michael Wilde
      wrote:<br />
    </div>
    <blockquote cite="mid:543EE736.5020301@anl.gov" type="cite">
      
      Justin, can you report on your latest findings on this?<br />
      <br />
      (I will fwd Justin's latest notes from Monday...)<br />
      <br />
      - Mike<br />
      <br />
      <div class="moz-cite-prefix">On 10/15/14 4:27 PM, Tim Armstrong
        wrote:<br />
      </div>
      <blockquote cite="mid:CAC0jiV4q7HbZfSHf0_-UasZFhTLeWHWeOVoaY=d04GMs4Ss48Q@mail.gmail.com" type="cite">
        
        <div dir="ltr">Did you reach a resolution on this?  I'm back in
          town now and could look at adding the ability to retrieve the
          exit code.<br />
          <br />
          - Tim<br />
        </div>
        <div class="gmail_extra"><br />
          <div class="gmail_quote">On Fri, Oct 3, 2014 at 12:07 AM,
            Michael Wilde <span dir="ltr"><<a moz-do-not-send="true" href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span>
            wrote:<br />
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> Also: identical
                code seems to run fine at this setting:<br />
                  PPN=16 turbine-cray-run.zsh -s titan.settings -n 64
                $tcl  # (4 nodes)<br />
                but fails at this setting:<br />
                  PPN=16 turbine-cray-run.zsh -s titan.settings -n 128
                $tcl # (8 nodes)<br />
                <br />
                Changing the app that calls DISCUS directly back to a
                shell wrapper around DISCUS shows that when the job 
                fails the app (shell wrapper) is indeed executed, and
                logs a message to the app stdout file which I see. But I
                dont see a message after DISCUS.<br />
                <br />
                To test if perhaps many concurrent DISCUS invocations
                (127) are causing a problem which is fatal to the whole
                job, I added a random sleep before DISCUS.  This had no
                effect on the problem.<br />
                <br />
                Next I eliminated the actual DISCUS call, and just did a
                random sleep from 0 to 9 seconds. Here I discovered that
                all the jobs with sleep > 0 failed in the same manner
                as with discus, while the sleep 0 cases all printed all
                their echos and exited.<br />
                <br />
                So there is something more deeply wrong going on, having
                nothing to do with the DISCUS app. I'll try to narrow
                this down, and see if perhaps its happening on other
                non-Cray systems like Midway.<br />
                <br />
                - Mike
                <div>
                  <div class="h5"><br />
                    <br />
                    <div>On 10/2/14 10:52 PM, Michael Wilde wrote:<br />
                    </div>
                    <blockquote type="cite"> Thanks, Tim. But, stderr
                      and stdout were both redirected to files, and all
                      .err and .out files are empty.<br />
                      <br />
                      I'll double-check that these redirects are
                      working, but in prior runs, I did indeed get app
                      error and output in those files.<br />
                      <br />
                      It looks more to me like the app is failing to
                      launch.<br />
                      <br />
                      - Mike<br />
                      <br />
                      <div>On 10/2/14 10:22 PM, Tim Armstrong wrote:<br />
                      </div>
                      <blockquote type="cite">
                        <div dir="ltr">
                          <div>The app stderr/stdout should go to
                            Swift/T stderr/stdout unless redirected. 
                            The problem is most likely discus returning
                            a non-zero error code.<br />
                            <br />
                          </div>
                          - Tim<br />
                        </div>
                        <div class="gmail_extra"><br />
                          <div class="gmail_quote">On Thu, Oct 2, 2014
                            at 8:42 PM, Michael Wilde <span dir="ltr"><<a moz-do-not-send="true" href="mailto:wilde@anl.gov" target="_blank">wilde@anl.gov</a>></span>
                            wrote:<br />
                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px
                              #ccc solid;padding-left:1ex"><br />
                              Im getting a failure from a Swift/T run on
                              Titan that I cant diagnose.<br />
                              <br />
                              The script is running 16K DISCUS
                              simulations, about 30 secs each, 16 per
                              node, on 8 nodes. 127 workers, 1 Server.
                              It seems to run well at smaller scale (eg
                              1K simulations).<br />
                              <br />
                              My PBS output file shows that some of the
                              initial round of simulations fail.  It
                              seems that when this happens, the Swift
                              script exits within the first round of
                              simulations, and none seem to even start.<br />
                              <br />
                              Some of the logs are below.  I'll continue
                              to experiment and add more logging to try
                              to isolate the cause.<br />
                              <br />
                              I was initially suspicious of an OOM
                              problem, but I dont see any sign of that.<br />
                              <br />
                              In the line below "Swift: external command
                              failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.264 1 0.75 ../Si.cll 0.004_0.264.inte
                              0.004_0.264.stru" - is there a way to tell
                              what Swift things the exit status of these
                              failing apps are?  What is causing the
                              whole run to abort?<br />
                              <br />
                              It almost looks like one app, or a small
                              number of them, are generating a SIGABRT,
                              but thats not clear, and I have no
                              stdout/err info from the app processes.<br />
                              <br />
                              I'll look for some more detailed debug
                              options, and/or add some wrappers around
                              the apps.<br />
                              <br />
                              Thanks,<br />
                              <br />
                              - Mike<br />
                              <br />
                              ---<br />
                              <br />
                              T$ cat output.txt.2115044.out<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.264 1 0.75 ../Si.cll 0.004_0.264.inte
                              0.004_0.264.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.272 1 0.75 ../Si.cll 0.004_0.272.inte
                              0.004_0.272.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.26
                              0.052000000000000005 1 0.75 ../Si.cll
                              0.260_0.052.inte 0.260_0.052.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.084 1 0.75 ../Si.cll 0.004_0.084.inte
                              0.004_0.084.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.26 0.024
                              1 0.75 ../Si.cll 0.260_0.024.inte
                              0.260_0.024.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.056 1 0.75 ../Si.cll 0.004_0.056.inte
                              0.004_0.056.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.364 1 0.75 ../Si.cll 0.004_0.364.inte
                              0.004_0.364.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.004
                              0.07200000000000001 1 0.75 ../Si.cll
                              0.004_0.072.inte 0.004_0.072.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              <br />
                              Swift: external command failed:
                              /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus




                              --macro ../stacking-faults3.mac 0.26 0.044
                              1 0.75 ../Si.cll 0.260_0.044.inte
                              0.260_0.044.stru<br />
                              <br />
                              Swift: killing MPI job...<br />
                              ADLB_Abort(1)<br />
                              MPI_Abort(1)<br />
                              Application 7706378 exit codes: 134<br />
                              Application 7706378 exit signals: Killed<br />
                              Application 7706378 resources: utime ~5s,
                              stime ~19s, Rss ~11016, inblocks ~48986,
                              outblocks ~205<br />
                              T$<br />
                              <br />
                              ---<br />
                              <br />
                              T$ cat output.txt<br />
                              bash_profile: loading modules<br />
                              DB /ccs/home/wildemj/.modules: loading
                              modules...<br />
                              Turbine: turbine-aprun.sh<br />
                              10/02/2014 09:15PM<br />
                              <br />
                              TURBINE_HOME:
                              /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine<br />
                              SCRIPT:
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl<br />
                              PROCS:        128<br />
                              NODES:        8<br />
                              PPN:          16<br />
                              WALLTIME:     00:15:00<br />
                              <br />
                              TURBINE_WORKERS: 127<br />
                              ADLB_SERVERS:    1<br />
                              <br />
                              TCLSH:
                              /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6<br />
                              <br />
                              JOB OUTPUT:<br />
                              <br />
                              Rank 2 [Thu Oct  2 21:15:32 2014]
                              [c12-0c1s1n3] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 2<br />
                              Rank 4 [Thu Oct  2 21:15:32 2014]
                              [c14-0c1s1n3] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 4<br />
                              Rank 6 [Thu Oct  2 21:15:32 2014]
                              [c14-0c1s1n1] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 6<br />
                              Rank 3 [Thu Oct  2 21:15:32 2014]
                              [c14-0c1s1n2] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 3<br />
                              Rank 5 [Thu Oct  2 21:15:32 2014]
                              [c14-0c1s1n0] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 5<br />
                              Rank 1 [Thu Oct  2 21:15:32 2014]
                              [c12-0c1s1n2] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 1<br />
                              Rank 7 [Thu Oct  2 21:15:32 2014]
                              [c14-0c1s0n0] application called
                              MPI_Abort(MPI_COMM_WORLD, 1) - process 7<br />
                              _pmiu_daemon(SIGCHLD): [NID 06114]
                              [c14-0c1s1n0] [Thu Oct  2 21:15:32 2014]
                              PE RANK 5 exit signal Aborted<br />
                              _pmiu_daemon(SIGCHLD): [NID 06076]
                              [c14-0c1s1n2] [Thu Oct  2 21:15:32 2014]
                              PE RANK 3 exit signal Aborted<br />
                              _pmiu_daemon(SIGCHLD): [NID 06077]
                              [c14-0c1s1n3] [Thu Oct  2 21:15:32 2014]
                              PE RANK 4 exit signal Aborted<br />
                              _pmiu_daemon(SIGCHLD): [NID 04675]
                              [c12-0c1s1n3] [Thu Oct  2 21:15:32 2014]
                              PE RANK 2 exit signal Aborted<br />
                              [NID 04675] 2014-10-02 21:15:32 Apid
                              7706378: initiated application termination<br />
                              _pmiu_daemon(SIGCHLD): [NID 06115]
                              [c14-0c1s1n1] [Thu Oct  2 21:15:32 2014]
                              PE RANK 6 exit signal Aborted<br />
                              T$<br />
                              T$<br />
                              T$<br />
                              T$ cat turbine.log<br />
                              JOB:               2115044<br />
                              COMMAND:           stacking-faults3.tcl<br />
                              HOSTNAME:          <a moz-do-not-send="true" href="http://ccs.ornl.gov" target="_blank">ccs.ornl.gov</a><br />
                              SUBMITTED:         10/02/2014 09:14PM<br />
                              PROCS:             128<br />
                              PPN:               16<br />
                              NODES:             8<br />
                              TURBINE_WORKERS:<br />
                              ADLB_SERVERS:<br />
                              WALLTIME:          00:15:00<br />
                              ADLB_EXHAUST_TIME:<br />
                              T$<span><font color="#888888"><br />
                                  <br />
                                  <br />
                                  <br />
                                  <br />
                                  -- <br />
                                  Michael Wilde<br />
                                  Mathematics and Computer Science     
                                      Computation Institute<br />
                                  Argonne National Laboratory           
                                     The University of Chicago<br />
                                  <br />
_______________________________________________<br />
                                  ExM-user mailing list<br />
                                  <a moz-do-not-send="true" href="mailto:ExM-user@lists.mcs.anl.gov" target="_blank">ExM-user@lists.mcs.anl.gov</a><br />
                                  <a moz-do-not-send="true" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br />
                                </font></span></blockquote>
                          </div>
                          <br />
                        </div>
                        <br />
                        <fieldset></fieldset>
                        <br />
                        <pre>_______________________________________________
ExM-user mailing list
<a moz-do-not-send="true" href="mailto:ExM-user@lists.mcs.anl.gov" target="_blank">ExM-user@lists.mcs.anl.gov</a>
<a moz-do-not-send="true" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a>
</pre>
                      </blockquote>
                      <br />
                      <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
                    </blockquote>
                    <br />
                    <pre cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
                  </div>
                </div>
              </div>
              <br />
              _______________________________________________<br />
              ExM-user mailing list<br />
              <a moz-do-not-send="true" href="mailto:ExM-user@lists.mcs.anl.gov">ExM-user@lists.mcs.anl.gov</a><br />
              <a moz-do-not-send="true" href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br />
              <br />
            </blockquote>
          </div>
          <br />
        </div>
      </blockquote>
      <br />
      <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
    </blockquote>
    <br />
    <pre class="moz-signature" cols="72">-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago
</pre>
  

<p style="margin-top: 2.5em; margin-bottom: 1em; border-bottom: 1px solid #000"></p><pre class="k9mail"><hr /><br />ExM-user mailing list<br />ExM-user@lists.mcs.anl.gov<br /><a href="https://lists.mcs.anl.gov/mailman/listinfo/exm-user">https://lists.mcs.anl.gov/mailman/listinfo/exm-user</a><br /></pre></blockquote></div><br>
-- <br>
Justin M Wozniak (via phone)</body></html>