[ExM Users] Can not diagnose Swift/T script failure

Wed Oct 15 17:04:03 CDT 2014

I added some additional error info about the exit code etc in any case.

 Tim

On Wed, Oct 15, 2014 at 4:31 PM, Michael Wilde <wilde at anl.gov> wrote:

>  Status as of Monday (but I know Justin did a few commits since...)
>
> -------- Forwarded Message --------  Subject: Re: Latest on the Swift/T
> Titan problem?  Date: Mon, 13 Oct 2014 16:11:31 -0500  From: Justin M
> Wozniak <wozniak at mcs.anl.gov> <wozniak at mcs.anl.gov>  To: Michael Wilde
> <wilde at anl.gov> <wilde at anl.gov>
>
> Pure shell: works
> Pure Tcl: works
> ADLB batcher: works
> <------->
> Swift/T: fails
>
> Something in Turbine breaks the OS.  I have been bisecting over a
> stripped-down Turbine to identify exactly what this is.  If I strip down
> the Turbine link step, I can get it to work.  So I think it is something
> in our use of Cray modules.  (I try to not use modules/softenv but it is
> affecting the link.)
>
> This may fix the Swift/NAMD issue, which would be a complementary
> positive outcome.
>
> On 10/13/2014 12:10 PM, Michael Wilde wrote:
> > One avenue I was wondering about for debugging:  What PrgEnv is the
> > Swift/T on Titan built for?
> > Im assuming gnu, but if so, how does that interact with the
> > PrgEnv-intel needed for the DISCUS app?
> >
> > I would think to handle such cases we need to run Turbine under the
> > PrgEnv it needs, and then run any app( ) calls that needs a different
> > PrgEnv under a wrapper shell script that provides it, just for that
> > process.
> >
> > Which may in turn raise the issue of whether "module load
> > PrgEnv-intel" for example is scalable to hundredes to tens of
> > thousands of concurrent invocations.
> >
> > - Mike
> >
> > On 10/13/14 11:38 AM, Justin M Wozniak wrote:
> >>
> >> Yes, I am testing launching tasks with a plain shell script and a
> >> plain Tcl script.  I will then move to the ADLB batcher program.
> >>
> >> On 10/13/2014 11:34 AM, Michael Wilde wrote:
> >>> Hi Justin,
> >>>
> >>> I did not get to spend any time on this over the weekend.
> >>>
> >>> What is your latest assessment of the issue?
> >>>
> >>> I did some runs Fri night that were disturbing: two runs of 256
> >>> sleeps on 128 cores, each yielding a different number of output
> >>> files (but missing ~ 5 in one case and ~ 10 in the other).
> >>>
> >>> I was going to do a test of a simple MPI app forking tasks with
> >>> system( ), to see if same failure pattern is seen or not, as #CPUs
> >>> increases from 128 to 1024.
> >>>
> ...
>
> --
> Justin M Wozniak
>
>
>
>  On 10/15/14 4:29 PM, Michael Wilde wrote:
>
> Justin, can you report on your latest findings on this?
>
> (I will fwd Justin's latest notes from Monday...)
>
> - Mike
>
> On 10/15/14 4:27 PM, Tim Armstrong wrote:
>
> Did you reach a resolution on this?  I'm back in town now and could look
> at adding the ability to retrieve the exit code.
>
> - Tim
>
> On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov> wrote:
>
>>  Also: identical code seems to run fine at this setting:
>>   PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl  # (4 nodes)
>> but fails at this setting:
>>   PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8 nodes)
>>
>> Changing the app that calls DISCUS directly back to a shell wrapper
>> around DISCUS shows that when the job  fails the app (shell wrapper) is
>> indeed executed, and logs a message to the app stdout file which I see. But
>> I dont see a message after DISCUS.
>>
>> To test if perhaps many concurrent DISCUS invocations (127) are causing a
>> problem which is fatal to the whole job, I added a random sleep before
>> DISCUS.  This had no effect on the problem.
>>
>> Next I eliminated the actual DISCUS call, and just did a random sleep
>> from 0 to 9 seconds. Here I discovered that all the jobs with sleep > 0
>> failed in the same manner as with discus, while the sleep 0 cases all
>> printed all their echos and exited.
>>
>> So there is something more deeply wrong going on, having nothing to do
>> with the DISCUS app. I'll try to narrow this down, and see if perhaps its
>> happening on other non-Cray systems like Midway.
>>
>> - Mike
>>
>>
>> On 10/2/14 10:52 PM, Michael Wilde wrote:
>>
>> Thanks, Tim. But, stderr and stdout were both redirected to files, and
>> all .err and .out files are empty.
>>
>> I'll double-check that these redirects are working, but in prior runs, I
>> did indeed get app error and output in those files.
>>
>> It looks more to me like the app is failing to launch.
>>
>> - Mike
>>
>> On 10/2/14 10:22 PM, Tim Armstrong wrote:
>>
>>  The app stderr/stdout should go to Swift/T stderr/stdout unless
>> redirected.  The problem is most likely discus returning a non-zero error
>> code.
>>
>>  - Tim
>>
>> On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov> wrote:
>>
>>>
>>> Im getting a failure from a Swift/T run on Titan that I cant diagnose.
>>>
>>> The script is running 16K DISCUS simulations, about 30 secs each, 16 per
>>> node, on 8 nodes. 127 workers, 1 Server. It seems to run well at smaller
>>> scale (eg 1K simulations).
>>>
>>> My PBS output file shows that some of the initial round of simulations
>>> fail.  It seems that when this happens, the Swift script exits within the
>>> first round of simulations, and none seem to even start.
>>>
>>> Some of the logs are below.  I'll continue to experiment and add more
>>> logging to try to isolate the cause.
>>>
>>> I was initially suspicious of an OOM problem, but I dont see any sign of
>>> that.
>>>
>>> In the line below "Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>>> 0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell what Swift
>>> things the exit status of these failing apps are?  What is causing the
>>> whole run to abort?
>>>
>>> It almost looks like one app, or a small number of them, are generating
>>> a SIGABRT, but thats not clear, and I have no stdout/err info from the app
>>> processes.
>>>
>>> I'll look for some more detailed debug options, and/or add some wrappers
>>> around the apps.
>>>
>>> Thanks,
>>>
>>> - Mike
>>>
>>> ---
>>>
>>> T$ cat output.txt.2115044.out
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>>> 0.004_0.264.inte 0.004_0.264.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
>>> 0.004_0.272.inte 0.004_0.272.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1 0.75 ../Si.cll
>>> 0.260_0.052.inte 0.260_0.052.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
>>> 0.004_0.084.inte 0.004_0.084.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
>>> 0.260_0.024.inte 0.260_0.024.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
>>> 0.004_0.056.inte 0.004_0.056.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
>>> 0.004_0.364.inte 0.004_0.364.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1 0.75 ../Si.cll
>>> 0.004_0.072.inte 0.004_0.072.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>> Swift: external command failed:
>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>> --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
>>> 0.260_0.044.inte 0.260_0.044.stru
>>>
>>> Swift: killing MPI job...
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>> Application 7706378 exit codes: 134
>>> Application 7706378 exit signals: Killed
>>> Application 7706378 resources: utime ~5s, stime ~19s, Rss ~11016,
>>> inblocks ~48986, outblocks ~205
>>> T$
>>>
>>> ---
>>>
>>> T$ cat output.txt
>>> bash_profile: loading modules
>>> DB /ccs/home/wildemj/.modules: loading modules...
>>> Turbine: turbine-aprun.sh
>>> 10/02/2014 09:15PM
>>>
>>> TURBINE_HOME: /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>>> SCRIPT:
>>> /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>>> PROCS:        128
>>> NODES:        8
>>> PPN:          16
>>> WALLTIME:     00:15:00
>>>
>>> TURBINE_WORKERS: 127
>>> ADLB_SERVERS:    1
>>>
>>> TCLSH: /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>>
>>> JOB OUTPUT:
>>>
>>> Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>>> Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>>> Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>>> Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>>> Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>>> Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>> Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0] application called
>>> MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>>> _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct  2 21:15:32
>>> 2014] PE RANK 5 exit signal Aborted
>>> _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct  2 21:15:32
>>> 2014] PE RANK 3 exit signal Aborted
>>> _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct  2 21:15:32
>>> 2014] PE RANK 4 exit signal Aborted
>>> _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct  2 21:15:32
>>> 2014] PE RANK 2 exit signal Aborted
>>> [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated application
>>> termination
>>> _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct  2 21:15:32
>>> 2014] PE RANK 6 exit signal Aborted
>>> T$
>>> T$
>>> T$
>>> T$ cat turbine.log
>>> JOB:               2115044
>>> COMMAND:           stacking-faults3.tcl
>>> HOSTNAME:          ccs.ornl.gov
>>> SUBMITTED:         10/02/2014 09:14PM
>>> PROCS:             128
>>> PPN:               16
>>> NODES:             8
>>> TURBINE_WORKERS:
>>> ADLB_SERVERS:
>>> WALLTIME:          00:15:00
>>> ADLB_EXHAUST_TIME:
>>> T$
>>>
>>>
>>>
>>>
>>> --
>>> Michael Wilde
>>> Mathematics and Computer Science          Computation Institute
>>> Argonne National Laboratory               The University of Chicago
>>>
>>> _______________________________________________
>>> ExM-user mailing list
>>> ExM-user at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>
>>
>>
>>
>> _______________________________________________
>> ExM-user mailing listExM-user at lists.mcs.anl.govhttps://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>>
>> --
>> Michael Wilde
>> Mathematics and Computer Science          Computation Institute
>> Argonne National Laboratory               The University of Chicago
>>
>>
>> --
>> Michael Wilde
>> Mathematics and Computer Science          Computation Institute
>> Argonne National Laboratory               The University of Chicago
>>
>>
>> _______________________________________________
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>>
>
> --
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
>
>
> --
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141015/12e98385/attachment-0001.html>