[ExM Users] Can not diagnose Swift/T script failure

Wed Oct 15 17:22:10 CDT 2014

Bisection showed that loading the Turbine so from plain Tcl breaks the OS. 

I had OLCF reset my shell to a blank bash so I can make pure use of modules.  I added support for this in the configuration.

I have been in touch with Tim today about doing a static build on the Cray.

On October 15, 2014 4:31:50 PM CDT, Michael Wilde <wilde at anl.gov> wrote:
>Status as of Monday (but I know Justin did a few commits since...)
>
>-------- Forwarded Message --------
>Subject: 	Re: Latest on the Swift/T Titan problem?
>Date: 	Mon, 13 Oct 2014 16:11:31 -0500
>From: 	Justin M Wozniak <wozniak at mcs.anl.gov>
>To: 	Michael Wilde <wilde at anl.gov>
>
>
>
>Pure shell: works
>Pure Tcl: works
>ADLB batcher: works
><------->
>Swift/T: fails
>
>Something in Turbine breaks the OS.  I have been bisecting over a
>stripped-down Turbine to identify exactly what this is.  If I strip
>down
>the Turbine link step, I can get it to work.  So I think it is
>something
>in our use of Cray modules.  (I try to not use modules/softenv but it
>is
>affecting the link.)
>
>This may fix the Swift/NAMD issue, which would be a complementary
>positive outcome.
>
>On 10/13/2014 12:10 PM, Michael Wilde wrote:
>> One avenue I was wondering about for debugging:  What PrgEnv is the
>> Swift/T on Titan built for?
>> Im assuming gnu, but if so, how does that interact with the
>> PrgEnv-intel needed for the DISCUS app?
>>
>> I would think to handle such cases we need to run Turbine under the
>> PrgEnv it needs, and then run any app( ) calls that needs a different
>> PrgEnv under a wrapper shell script that provides it, just for that
>> process.
>>
>> Which may in turn raise the issue of whether "module load
>> PrgEnv-intel" for example is scalable to hundredes to tens of
>> thousands of concurrent invocations.
>>
>> - Mike
>>
>> On 10/13/14 11:38 AM, Justin M Wozniak wrote:
>>>
>>> Yes, I am testing launching tasks with a plain shell script and a
>>> plain Tcl script.  I will then move to the ADLB batcher program.
>>>
>>> On 10/13/2014 11:34 AM, Michael Wilde wrote:
>>>> Hi Justin,
>>>>
>>>> I did not get to spend any time on this over the weekend.
>>>>
>>>> What is your latest assessment of the issue?
>>>>
>>>> I did some runs Fri night that were disturbing: two runs of 256
>>>> sleeps on 128 cores, each yielding a different number of output
>>>> files (but missing ~ 5 in one case and ~ 10 in the other).
>>>>
>>>> I was going to do a test of a simple MPI app forking tasks with
>>>> system( ), to see if same failure pattern is seen or not, as #CPUs
>>>> increases from 128 to 1024.
>>>>
>...
>
>-- 
>Justin M Wozniak
>
>
>On 10/15/14 4:29 PM, Michael Wilde wrote:
>> Justin, can you report on your latest findings on this?
>>
>> (I will fwd Justin's latest notes from Monday...)
>>
>> - Mike
>>
>> On 10/15/14 4:27 PM, Tim Armstrong wrote:
>>> Did you reach a resolution on this?  I'm back in town now and could 
>>> look at adding the ability to retrieve the exit code.
>>>
>>> - Tim
>>>
>>> On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov 
>>> <mailto:wilde at anl.gov>> wrote:
>>>
>>>     Also: identical code seems to run fine at this setting:
>>>       PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl  # (4
>>>     nodes)
>>>     but fails at this setting:
>>>       PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8
>>>     nodes)
>>>
>>>     Changing the app that calls DISCUS directly back to a shell
>>>     wrapper around DISCUS shows that when the job fails the app
>>>     (shell wrapper) is indeed executed, and logs a message to the
>app
>>>     stdout file which I see. But I dont see a message after DISCUS.
>>>
>>>     To test if perhaps many concurrent DISCUS invocations (127) are
>>>     causing a problem which is fatal to the whole job, I added a
>>>     random sleep before DISCUS.  This had no effect on the problem.
>>>
>>>     Next I eliminated the actual DISCUS call, and just did a random
>>>     sleep from 0 to 9 seconds. Here I discovered that all the jobs
>>>     with sleep > 0 failed in the same manner as with discus, while
>>>     the sleep 0 cases all printed all their echos and exited.
>>>
>>>     So there is something more deeply wrong going on, having nothing
>>>     to do with the DISCUS app. I'll try to narrow this down, and see
>>>     if perhaps its happening on other non-Cray systems like Midway.
>>>
>>>     - Mike
>>>
>>>
>>>     On 10/2/14 10:52 PM, Michael Wilde wrote:
>>>>     Thanks, Tim. But, stderr and stdout were both redirected to
>>>>     files, and all .err and .out files are empty.
>>>>
>>>>     I'll double-check that these redirects are working, but in
>prior
>>>>     runs, I did indeed get app error and output in those files.
>>>>
>>>>     It looks more to me like the app is failing to launch.
>>>>
>>>>     - Mike
>>>>
>>>>     On 10/2/14 10:22 PM, Tim Armstrong wrote:
>>>>>     The app stderr/stdout should go to Swift/T stderr/stdout
>unless
>>>>>     redirected. The problem is most likely discus returning a
>>>>>     non-zero error code.
>>>>>
>>>>>     - Tim
>>>>>
>>>>>     On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov
>>>>>     <mailto:wilde at anl.gov>> wrote:
>>>>>
>>>>>
>>>>>         Im getting a failure from a Swift/T run on Titan that I
>>>>>         cant diagnose.
>>>>>
>>>>>         The script is running 16K DISCUS simulations, about 30
>secs
>>>>>         each, 16 per node, on 8 nodes. 127 workers, 1 Server. It
>>>>>         seems to run well at smaller scale (eg 1K simulations).
>>>>>
>>>>>         My PBS output file shows that some of the initial round of
>>>>>         simulations fail.  It seems that when this happens, the
>>>>>         Swift script exits within the first round of simulations,
>>>>>         and none seem to even start.
>>>>>
>>>>>         Some of the logs are below.  I'll continue to experiment
>>>>>         and add more logging to try to isolate the cause.
>>>>>
>>>>>         I was initially suspicious of an OOM problem, but I dont
>>>>>         see any sign of that.
>>>>>
>>>>>         In the line below "Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>>>         ../Si.cll 0.004_0.264.inte 0.004_0.264.stru" - is there a
>>>>>         way to tell what Swift things the exit status of these
>>>>>         failing apps are?  What is causing the whole run to abort?
>>>>>
>>>>>         It almost looks like one app, or a small number of them,
>>>>>         are generating a SIGABRT, but thats not clear, and I have
>>>>>         no stdout/err info from the app processes.
>>>>>
>>>>>         I'll look for some more detailed debug options, and/or add
>>>>>         some wrappers around the apps.
>>>>>
>>>>>         Thanks,
>>>>>
>>>>>         - Mike
>>>>>
>>>>>         ---
>>>>>
>>>>>         T$ cat output.txt.2115044.out
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>>>         ../Si.cll 0.004_0.264.inte 0.004_0.264.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.272 1 0.75
>>>>>         ../Si.cll 0.004_0.272.inte 0.004_0.272.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.26 0.052000000000000005
>1
>>>>>         0.75 ../Si.cll 0.260_0.052.inte 0.260_0.052.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.084 1 0.75
>>>>>         ../Si.cll 0.004_0.084.inte 0.004_0.084.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.26 0.024 1 0.75
>../Si.cll
>>>>>         0.260_0.024.inte 0.260_0.024.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.056 1 0.75
>>>>>         ../Si.cll 0.004_0.056.inte 0.004_0.056.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.364 1 0.75
>>>>>         ../Si.cll 0.004_0.364.inte 0.004_0.364.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.004 0.07200000000000001
>1
>>>>>         0.75 ../Si.cll 0.004_0.072.inte 0.004_0.072.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>
>>>>>         Swift: external command failed:
>>>>>        
>/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>         --macro ../stacking-faults3.mac 0.26 0.044 1 0.75
>../Si.cll
>>>>>         0.260_0.044.inte 0.260_0.044.stru
>>>>>
>>>>>         Swift: killing MPI job...
>>>>>         ADLB_Abort(1)
>>>>>         MPI_Abort(1)
>>>>>         Application 7706378 exit codes: 134
>>>>>         Application 7706378 exit signals: Killed
>>>>>         Application 7706378 resources: utime ~5s, stime ~19s, Rss
>>>>>         ~11016, inblocks ~48986, outblocks ~205
>>>>>         T$
>>>>>
>>>>>         ---
>>>>>
>>>>>         T$ cat output.txt
>>>>>         bash_profile: loading modules
>>>>>         DB /ccs/home/wildemj/.modules: loading modules...
>>>>>         Turbine: turbine-aprun.sh
>>>>>         10/02/2014 09:15PM
>>>>>
>>>>>         TURBINE_HOME:
>>>>>         /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>>>>>         SCRIPT:
>>>>>        
>/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>>>>>         PROCS:        128
>>>>>         NODES:        8
>>>>>         PPN:          16
>>>>>         WALLTIME:     00:15:00
>>>>>
>>>>>         TURBINE_WORKERS: 127
>>>>>         ADLB_SERVERS:    1
>>>>>
>>>>>         TCLSH:
>>>>>        
>/lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>>>>
>>>>>         JOB OUTPUT:
>>>>>
>>>>>         Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>>>>>         Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>>>>>         Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>>>>>         Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>>>>>         Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>>>>>         Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>>>>         Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0]
>application
>>>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>>>>>         _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct 
>>>>>         2 21:15:32 2014] PE RANK 5 exit signal Aborted
>>>>>         _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct 
>>>>>         2 21:15:32 2014] PE RANK 3 exit signal Aborted
>>>>>         _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct 
>>>>>         2 21:15:32 2014] PE RANK 4 exit signal Aborted
>>>>>         _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct 
>>>>>         2 21:15:32 2014] PE RANK 2 exit signal Aborted
>>>>>         [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
>>>>>         application termination
>>>>>         _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct 
>>>>>         2 21:15:32 2014] PE RANK 6 exit signal Aborted
>>>>>         T$
>>>>>         T$
>>>>>         T$
>>>>>         T$ cat turbine.log
>>>>>         JOB:               2115044
>>>>>         COMMAND:           stacking-faults3.tcl
>>>>>         HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>>>>>         SUBMITTED:         10/02/2014 09:14PM
>>>>>         PROCS:             128
>>>>>         PPN:               16
>>>>>         NODES:             8
>>>>>         TURBINE_WORKERS:
>>>>>         ADLB_SERVERS:
>>>>>         WALLTIME:          00:15:00
>>>>>         ADLB_EXHAUST_TIME:
>>>>>         T$
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>         -- 
>>>>>         Michael Wilde
>>>>>         Mathematics and Computer Science     Computation Institute
>>>>>         Argonne National Laboratory    The University of Chicago
>>>>>
>>>>>         _______________________________________________
>>>>>         ExM-user mailing list
>>>>>         ExM-user at lists.mcs.anl.gov
><mailto:ExM-user at lists.mcs.anl.gov>
>>>>>         https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     _______________________________________________
>>>>>     ExM-user mailing list
>>>>>     ExM-user at lists.mcs.anl.gov 
><mailto:ExM-user at lists.mcs.anl.gov>
>>>>>     https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>
>>>>     -- 
>>>>     Michael Wilde
>>>>     Mathematics and Computer Science          Computation Institute
>>>>     Argonne National Laboratory               The University of
>Chicago
>>>
>>>     -- 
>>>     Michael Wilde
>>>     Mathematics and Computer Science          Computation Institute
>>>     Argonne National Laboratory               The University of
>Chicago
>>>
>>>
>>>     _______________________________________________
>>>     ExM-user mailing list
>>>     ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>     https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>
>>>
>>
>> -- 
>> Michael Wilde
>> Mathematics and Computer Science          Computation Institute
>> Argonne National Laboratory               The University of Chicago
>
>-- 
>Michael Wilde
>Mathematics and Computer Science          Computation Institute
>Argonne National Laboratory               The University of Chicago
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>ExM-user mailing list
>ExM-user at lists.mcs.anl.gov
>https://lists.mcs.anl.gov/mailman/listinfo/exm-user

-- 
Justin M Wozniak (via phone)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141015/22a6cfc9/attachment-0001.html>