[ExM Users] Can not diagnose Swift/T script failure

Michael Wilde wilde at anl.gov
Thu Oct 16 13:21:15 CDT 2014


Tim, Justin, have you done the static build on Titan, and if so, has it 
resolved the problem?

- Mike

On 10/15/14 5:22 PM, Justin M Wozniak wrote:
>
> Bisection showed that loading the Turbine so from plain Tcl breaks the 
> OS.
>
> I had OLCF reset my shell to a blank bash so I can make pure use of 
> modules. I added support for this in the configuration.
>
> I have been in touch with Tim today about doing a static build on the 
> Cray.
>
>
> On October 15, 2014 4:31:50 PM CDT, Michael Wilde <wilde at anl.gov> wrote:
>
>     Status as of Monday (but I know Justin did a few commits since...)
>
>     -------- Forwarded Message --------
>     Subject: 	Re: Latest on the Swift/T Titan problem?
>     Date: 	Mon, 13 Oct 2014 16:11:31 -0500
>     From: 	Justin M Wozniak <wozniak at mcs.anl.gov>
>     To: 	Michael Wilde <wilde at anl.gov>
>
>
>
>     Pure shell: works
>     Pure Tcl: works
>     ADLB batcher: works
>     <------->
>     Swift/T: fails
>
>     Something in Turbine breaks the OS.  I have been bisecting over a
>     stripped-down Turbine to identify exactly what this is.  If I strip down
>     the Turbine link step, I can get it to work.  So I think it is something
>     in our use of Cray modules.  (I try to not use modules/softenv but it is
>     affecting the link.)
>
>     This may fix the Swift/NAMD issue, which would be a complementary
>     positive outcome.
>
>     On 10/13/2014 12:10 PM, Michael Wilde wrote:
>     > One avenue I was wondering about for debugging:  What PrgEnv is the
>     > Swift/T on Titan built for?
>     > Im assuming gnu, but if so, how does that interact with the
>     > PrgEnv-intel needed for the DISCUS app?
>     >
>     > I would think to handle such cases we need to run Turbine under the
>     > PrgEnv it needs, and then run any app( ) calls that needs a different
>     > PrgEnv under a wrapper shell script that provides it, just for that
>     > process.
>     >
>     > Which may in turn raise the issue of whether "module load
>     > PrgEnv-intel" for example is scalable to hundredes to tens of
>     > thousands of concurrent invocations.
>     >
>     > - Mike
>     >
>     > On 10/13/14 11:38 AM, Justin M Wozniak wrote:
>     >>
>     >> Yes, I am testing launching tasks with a plain shell script and a
>     >> plain Tcl script.  I will then move to the ADLB batcher program.
>     >>
>     >> On 10/13/2014 11:34 AM, Michael Wilde wrote:
>     >>> Hi Justin,
>     >>>
>     >>> I did not get to spend any time on this over the weekend.
>     >>>
>     >>> What is your latest assessment of the issue?
>     >>>
>     >>> I did some runs Fri night that were disturbing: two runs of 256
>     >>> sleeps on 128 cores, each yielding a different number of output
>     >>> files (but missing ~ 5 in one case and ~ 10 in the other).
>     >>>
>     >>> I was going to do a test of a simple MPI app forking tasks with
>     >>> system( ), to see if same failure pattern is seen or not, as #CPUs
>     >>> increases from 128 to 1024.
>     >>>
>     ...
>
>     -- 
>     Justin M Wozniak
>
>
>     On 10/15/14 4:29 PM, Michael Wilde wrote:
>>     Justin, can you report on your latest findings on this?
>>
>>     (I will fwd Justin's latest notes from Monday...)
>>
>>     - Mike
>>
>>     On 10/15/14 4:27 PM, Tim Armstrong wrote:
>>>     Did you reach a resolution on this?  I'm back in town now and
>>>     could look at adding the ability to retrieve the exit code.
>>>
>>>     - Tim
>>>
>>>     On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov
>>>     <mailto:wilde at anl.gov>> wrote:
>>>
>>>         Also: identical code seems to run fine at this setting:
>>>           PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl 
>>>         # (4 nodes)
>>>         but fails at this setting:
>>>           PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl
>>>         # (8 nodes)
>>>
>>>         Changing the app that calls DISCUS directly back to a shell
>>>         wrapper around DISCUS shows that when the job  fails the app
>>>         (shell wrapper) is indeed executed, and logs a message to
>>>         the app stdout file which I see. But I dont see a message
>>>         after DISCUS.
>>>
>>>         To test if perhaps many concurrent DISCUS invocations (127)
>>>         are causing a problem which is fatal to the whole job, I
>>>         added a random sleep before DISCUS.  This had no effect on
>>>         the problem.
>>>
>>>         Next I eliminated the actual DISCUS call, and just did a
>>>         random sleep from 0 to 9 seconds. Here I discovered that all
>>>         the jobs with sleep > 0 failed in the same manner as with
>>>         discus, while the sleep 0 cases all printed all their echos
>>>         and exited.
>>>
>>>         So there is something more deeply wrong going on, having
>>>         nothing to do with the DISCUS app. I'll try to narrow this
>>>         down, and see if perhaps its happening on other non-Cray
>>>         systems like Midway.
>>>
>>>         - Mike
>>>
>>>
>>>         On 10/2/14 10:52 PM, Michael Wilde wrote:
>>>>         Thanks, Tim. But, stderr and stdout were both redirected to
>>>>         files, and all .err and .out files are empty.
>>>>
>>>>         I'll double-check that these redirects are working, but in
>>>>         prior runs, I did indeed get app error and output in those
>>>>         files.
>>>>
>>>>         It looks more to me like the app is failing to launch.
>>>>
>>>>         - Mike
>>>>
>>>>         On 10/2/14 10:22 PM, Tim Armstrong wrote:
>>>>>         The app stderr/stdout should go to Swift/T stderr/stdout
>>>>>         unless redirected.  The problem is most likely discus
>>>>>         returning a non-zero error code.
>>>>>
>>>>>         - Tim
>>>>>
>>>>>         On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde
>>>>>         <wilde at anl.gov <mailto:wilde at anl.gov>> wrote:
>>>>>
>>>>>
>>>>>             Im getting a failure from a Swift/T run on Titan that
>>>>>             I cant diagnose.
>>>>>
>>>>>             The script is running 16K DISCUS simulations, about 30
>>>>>             secs each, 16 per node, on 8 nodes. 127 workers, 1
>>>>>             Server. It seems to run well at smaller scale (eg 1K
>>>>>             simulations).
>>>>>
>>>>>             My PBS output file shows that some of the initial
>>>>>             round of simulations fail.  It seems that when this
>>>>>             happens, the Swift script exits within the first round
>>>>>             of simulations, and none seem to even start.
>>>>>
>>>>>             Some of the logs are below.  I'll continue to
>>>>>             experiment and add more logging to try to isolate the
>>>>>             cause.
>>>>>
>>>>>             I was initially suspicious of an OOM problem, but I
>>>>>             dont see any sign of that.
>>>>>
>>>>>             In the line below "Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>>>             ../Si.cll 0.004_0.264.inte 0.004_0.264.stru" - is
>>>>>             there a way to tell what Swift things the exit status
>>>>>             of these failing apps are?  What is causing the whole
>>>>>             run to abort?
>>>>>
>>>>>             It almost looks like one app, or a small number of
>>>>>             them, are generating a SIGABRT, but thats not clear,
>>>>>             and I have no stdout/err info from the app processes.
>>>>>
>>>>>             I'll look for some more detailed debug options, and/or
>>>>>             add some wrappers around the apps.
>>>>>
>>>>>             Thanks,
>>>>>
>>>>>             - Mike
>>>>>
>>>>>             ---
>>>>>
>>>>>             T$ cat output.txt.2115044.out
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>>>             ../Si.cll 0.004_0.264.inte 0.004_0.264.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004 0.272 1 0.75
>>>>>             ../Si.cll 0.004_0.272.inte 0.004_0.272.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.26
>>>>>             0.052000000000000005 1 0.75 ../Si.cll 0.260_0.052.inte
>>>>>             0.260_0.052.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004 0.084 1 0.75
>>>>>             ../Si.cll 0.004_0.084.inte 0.004_0.084.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.26 0.024 1 0.75
>>>>>             ../Si.cll 0.260_0.024.inte 0.260_0.024.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004 0.056 1 0.75
>>>>>             ../Si.cll 0.004_0.056.inte 0.004_0.056.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004 0.364 1 0.75
>>>>>             ../Si.cll 0.004_0.364.inte 0.004_0.364.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.004
>>>>>             0.07200000000000001 1 0.75 ../Si.cll 0.004_0.072.inte
>>>>>             0.004_0.072.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>
>>>>>             Swift: external command failed:
>>>>>             /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>             --macro ../stacking-faults3.mac 0.26 0.044 1 0.75
>>>>>             ../Si.cll 0.260_0.044.inte 0.260_0.044.stru
>>>>>
>>>>>             Swift: killing MPI job...
>>>>>             ADLB_Abort(1)
>>>>>             MPI_Abort(1)
>>>>>             Application 7706378 exit codes: 134
>>>>>             Application 7706378 exit signals: Killed
>>>>>             Application 7706378 resources: utime ~5s, stime ~19s,
>>>>>             Rss ~11016, inblocks ~48986, outblocks ~205
>>>>>             T$
>>>>>
>>>>>             ---
>>>>>
>>>>>             T$ cat output.txt
>>>>>             bash_profile: loading modules
>>>>>             DB /ccs/home/wildemj/.modules: loading modules...
>>>>>             Turbine: turbine-aprun.sh
>>>>>             10/02/2014 09:15PM
>>>>>
>>>>>             TURBINE_HOME:
>>>>>             /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>>>>>             SCRIPT:
>>>>>             /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>>>>>             PROCS:        128
>>>>>             NODES:        8
>>>>>             PPN:          16
>>>>>             WALLTIME:     00:15:00
>>>>>
>>>>>             TURBINE_WORKERS: 127
>>>>>             ADLB_SERVERS:    1
>>>>>
>>>>>             TCLSH:
>>>>>             /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>>>>
>>>>>             JOB OUTPUT:
>>>>>
>>>>>             Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 2
>>>>>             Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 4
>>>>>             Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 6
>>>>>             Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 3
>>>>>             Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 5
>>>>>             Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 1
>>>>>             Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0]
>>>>>             application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>             process 7
>>>>>             _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu
>>>>>             Oct  2 21:15:32 2014] PE RANK 5 exit signal Aborted
>>>>>             _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu
>>>>>             Oct  2 21:15:32 2014] PE RANK 3 exit signal Aborted
>>>>>             _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu
>>>>>             Oct  2 21:15:32 2014] PE RANK 4 exit signal Aborted
>>>>>             _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu
>>>>>             Oct  2 21:15:32 2014] PE RANK 2 exit signal Aborted
>>>>>             [NID 04675] 2014-10-02 21:15:32 Apid 7706378:
>>>>>             initiated application termination
>>>>>             _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu
>>>>>             Oct  2 21:15:32 2014] PE RANK 6 exit signal Aborted
>>>>>             T$
>>>>>             T$
>>>>>             T$
>>>>>             T$ cat turbine.log
>>>>>             JOB:               2115044
>>>>>             COMMAND:  stacking-faults3.tcl
>>>>>             HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>>>>>             SUBMITTED:         10/02/2014 09:14PM
>>>>>             PROCS:             128
>>>>>             PPN:               16
>>>>>             NODES:             8
>>>>>             TURBINE_WORKERS:
>>>>>             ADLB_SERVERS:
>>>>>             WALLTIME:          00:15:00
>>>>>             ADLB_EXHAUST_TIME:
>>>>>             T$
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             -- 
>>>>>             Michael Wilde
>>>>>             Mathematics and Computer Science          Computation
>>>>>             Institute
>>>>>             Argonne National Laboratory          The University of
>>>>>             Chicago
>>>>>
>>>>>             _______________________________________________
>>>>>             ExM-user mailing list
>>>>>             ExM-user at lists.mcs.anl.gov
>>>>>             <mailto:ExM-user at lists.mcs.anl.gov>
>>>>>             https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         _______________________________________________
>>>>>         ExM-user mailing list
>>>>>         ExM-user at lists.mcs.anl.gov  <mailto:ExM-user at lists.mcs.anl.gov>
>>>>>         https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>
>>>>         -- 
>>>>         Michael Wilde
>>>>         Mathematics and Computer Science          Computation Institute
>>>>         Argonne National Laboratory               The University of Chicago
>>>
>>>         -- 
>>>         Michael Wilde
>>>         Mathematics and Computer Science          Computation Institute
>>>         Argonne National Laboratory               The University of Chicago
>>>
>>>
>>>         _______________________________________________
>>>         ExM-user mailing list
>>>         ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>         https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>
>>>
>>
>>     -- 
>>     Michael Wilde
>>     Mathematics and Computer Science          Computation Institute
>>     Argonne National Laboratory               The University of Chicago
>
>     -- 
>     Michael Wilde
>     Mathematics and Computer Science          Computation Institute
>     Argonne National Laboratory               The University of Chicago
>
>     ------------------------------------------------------------------------
>
>     ExM-user mailing list
>     ExM-user at lists.mcs.anl.gov
>     https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
>
> -- 
> Justin M Wozniak (via phone) 

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141016/fde443cc/attachment-0001.html>


More information about the ExM-user mailing list