[ExM Users] Can not diagnose Swift/T script failure
Justin M Wozniak
wozniak at mcs.anl.gov
Thu Oct 16 13:44:14 CDT 2014
I'm working on it. I'm having trouble getting the modules-based cc
wrapper to compile simple things. I'm reading posts like this one from
people with similar issues:
http://public.kitware.com/pipermail/paraview/2013-June/028679.html
On 10/16/2014 01:21 PM, Michael Wilde wrote:
> Tim, Justin, have you done the static build on Titan, and if so, has
> it resolved the problem?
>
> - Mike
>
> On 10/15/14 5:22 PM, Justin M Wozniak wrote:
>>
>> Bisection showed that loading the Turbine so from plain Tcl breaks
>> the OS.
>>
>> I had OLCF reset my shell to a blank bash so I can make pure use of
>> modules. I added support for this in the configuration.
>>
>> I have been in touch with Tim today about doing a static build on the
>> Cray.
>>
>>
>> On October 15, 2014 4:31:50 PM CDT, Michael Wilde <wilde at anl.gov> wrote:
>>
>> Status as of Monday (but I know Justin did a few commits since...)
>>
>> -------- Forwarded Message --------
>> Subject: Re: Latest on the Swift/T Titan problem?
>> Date: Mon, 13 Oct 2014 16:11:31 -0500
>> From: Justin M Wozniak <wozniak at mcs.anl.gov>
>> To: Michael Wilde <wilde at anl.gov>
>>
>>
>>
>> Pure shell: works
>> Pure Tcl: works
>> ADLB batcher: works
>> <------->
>> Swift/T: fails
>>
>> Something in Turbine breaks the OS. I have been bisecting over a
>> stripped-down Turbine to identify exactly what this is. If I strip down
>> the Turbine link step, I can get it to work. So I think it is something
>> in our use of Cray modules. (I try to not use modules/softenv but it is
>> affecting the link.)
>>
>> This may fix the Swift/NAMD issue, which would be a complementary
>> positive outcome.
>>
>> On 10/13/2014 12:10 PM, Michael Wilde wrote:
>> > One avenue I was wondering about for debugging: What PrgEnv is the
>> > Swift/T on Titan built for?
>> > Im assuming gnu, but if so, how does that interact with the
>> > PrgEnv-intel needed for the DISCUS app?
>> >
>> > I would think to handle such cases we need to run Turbine under the
>> > PrgEnv it needs, and then run any app( ) calls that needs a different
>> > PrgEnv under a wrapper shell script that provides it, just for that
>> > process.
>> >
>> > Which may in turn raise the issue of whether "module load
>> > PrgEnv-intel" for example is scalable to hundredes to tens of
>> > thousands of concurrent invocations.
>> >
>> > - Mike
>> >
>> > On 10/13/14 11:38 AM, Justin M Wozniak wrote:
>> >>
>> >> Yes, I am testing launching tasks with a plain shell script and a
>> >> plain Tcl script. I will then move to the ADLB batcher program.
>> >>
>> >> On 10/13/2014 11:34 AM, Michael Wilde wrote:
>> >>> Hi Justin,
>> >>>
>> >>> I did not get to spend any time on this over the weekend.
>> >>>
>> >>> What is your latest assessment of the issue?
>> >>>
>> >>> I did some runs Fri night that were disturbing: two runs of 256
>> >>> sleeps on 128 cores, each yielding a different number of output
>> >>> files (but missing ~ 5 in one case and ~ 10 in the other).
>> >>>
>> >>> I was going to do a test of a simple MPI app forking tasks with
>> >>> system( ), to see if same failure pattern is seen or not, as #CPUs
>> >>> increases from 128 to 1024.
>> >>>
>> ...
>>
>> --
>> Justin M Wozniak
>>
>>
>> On 10/15/14 4:29 PM, Michael Wilde wrote:
>>> Justin, can you report on your latest findings on this?
>>>
>>> (I will fwd Justin's latest notes from Monday...)
>>>
>>> - Mike
>>>
>>> On 10/15/14 4:27 PM, Tim Armstrong wrote:
>>>> Did you reach a resolution on this? I'm back in town now and
>>>> could look at adding the ability to retrieve the exit code.
>>>>
>>>> - Tim
>>>>
>>>> On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov
>>>> <mailto:wilde at anl.gov>> wrote:
>>>>
>>>> Also: identical code seems to run fine at this setting:
>>>> PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl
>>>> # (4 nodes)
>>>> but fails at this setting:
>>>> PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl
>>>> # (8 nodes)
>>>>
>>>> Changing the app that calls DISCUS directly back to a shell
>>>> wrapper around DISCUS shows that when the job fails the
>>>> app (shell wrapper) is indeed executed, and logs a message
>>>> to the app stdout file which I see. But I dont see a
>>>> message after DISCUS.
>>>>
>>>> To test if perhaps many concurrent DISCUS invocations (127)
>>>> are causing a problem which is fatal to the whole job, I
>>>> added a random sleep before DISCUS. This had no effect on
>>>> the problem.
>>>>
>>>> Next I eliminated the actual DISCUS call, and just did a
>>>> random sleep from 0 to 9 seconds. Here I discovered that
>>>> all the jobs with sleep > 0 failed in the same manner as
>>>> with discus, while the sleep 0 cases all printed all their
>>>> echos and exited.
>>>>
>>>> So there is something more deeply wrong going on, having
>>>> nothing to do with the DISCUS app. I'll try to narrow this
>>>> down, and see if perhaps its happening on other non-Cray
>>>> systems like Midway.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> On 10/2/14 10:52 PM, Michael Wilde wrote:
>>>>> Thanks, Tim. But, stderr and stdout were both redirected
>>>>> to files, and all .err and .out files are empty.
>>>>>
>>>>> I'll double-check that these redirects are working, but in
>>>>> prior runs, I did indeed get app error and output in those
>>>>> files.
>>>>>
>>>>> It looks more to me like the app is failing to launch.
>>>>>
>>>>> - Mike
>>>>>
>>>>> On 10/2/14 10:22 PM, Tim Armstrong wrote:
>>>>>> The app stderr/stdout should go to Swift/T stderr/stdout
>>>>>> unless redirected. The problem is most likely discus
>>>>>> returning a non-zero error code.
>>>>>>
>>>>>> - Tim
>>>>>>
>>>>>> On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde
>>>>>> <wilde at anl.gov <mailto:wilde at anl.gov>> wrote:
>>>>>>
>>>>>>
>>>>>> Im getting a failure from a Swift/T run on Titan that
>>>>>> I cant diagnose.
>>>>>>
>>>>>> The script is running 16K DISCUS simulations, about
>>>>>> 30 secs each, 16 per node, on 8 nodes. 127 workers, 1
>>>>>> Server. It seems to run well at smaller scale (eg 1K
>>>>>> simulations).
>>>>>>
>>>>>> My PBS output file shows that some of the initial
>>>>>> round of simulations fail. It seems that when this
>>>>>> happens, the Swift script exits within the first
>>>>>> round of simulations, and none seem to even start.
>>>>>>
>>>>>> Some of the logs are below. I'll continue to
>>>>>> experiment and add more logging to try to isolate the
>>>>>> cause.
>>>>>>
>>>>>> I was initially suspicious of an OOM problem, but I
>>>>>> dont see any sign of that.
>>>>>>
>>>>>> In the line below "Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>>>> ../Si.cll 0.004_0.264.inte 0.004_0.264.stru" - is
>>>>>> there a way to tell what Swift things the exit status
>>>>>> of these failing apps are? What is causing the whole
>>>>>> run to abort?
>>>>>>
>>>>>> It almost looks like one app, or a small number of
>>>>>> them, are generating a SIGABRT, but thats not clear,
>>>>>> and I have no stdout/err info from the app processes.
>>>>>>
>>>>>> I'll look for some more detailed debug options,
>>>>>> and/or add some wrappers around the apps.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> T$ cat output.txt.2115044.out
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>>>> ../Si.cll 0.004_0.264.inte 0.004_0.264.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004 0.272 1 0.75
>>>>>> ../Si.cll 0.004_0.272.inte 0.004_0.272.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.26
>>>>>> 0.052000000000000005 1 0.75 ../Si.cll
>>>>>> 0.260_0.052.inte 0.260_0.052.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004 0.084 1 0.75
>>>>>> ../Si.cll 0.004_0.084.inte 0.004_0.084.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.26 0.024 1 0.75
>>>>>> ../Si.cll 0.260_0.024.inte 0.260_0.024.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004 0.056 1 0.75
>>>>>> ../Si.cll 0.004_0.056.inte 0.004_0.056.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004 0.364 1 0.75
>>>>>> ../Si.cll 0.004_0.364.inte 0.004_0.364.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.004
>>>>>> 0.07200000000000001 1 0.75 ../Si.cll 0.004_0.072.inte
>>>>>> 0.004_0.072.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>>
>>>>>> Swift: external command failed:
>>>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>>>> --macro ../stacking-faults3.mac 0.26 0.044 1 0.75
>>>>>> ../Si.cll 0.260_0.044.inte 0.260_0.044.stru
>>>>>>
>>>>>> Swift: killing MPI job...
>>>>>> ADLB_Abort(1)
>>>>>> MPI_Abort(1)
>>>>>> Application 7706378 exit codes: 134
>>>>>> Application 7706378 exit signals: Killed
>>>>>> Application 7706378 resources: utime ~5s, stime ~19s,
>>>>>> Rss ~11016, inblocks ~48986, outblocks ~205
>>>>>> T$
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> T$ cat output.txt
>>>>>> bash_profile: loading modules
>>>>>> DB /ccs/home/wildemj/.modules: loading modules...
>>>>>> Turbine: turbine-aprun.sh
>>>>>> 10/02/2014 09:15PM
>>>>>>
>>>>>> TURBINE_HOME:
>>>>>> /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>>>>>> SCRIPT:
>>>>>> /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>>>>>> PROCS: 128
>>>>>> NODES: 8
>>>>>> PPN: 16
>>>>>> WALLTIME: 00:15:00
>>>>>>
>>>>>> TURBINE_WORKERS: 127
>>>>>> ADLB_SERVERS: 1
>>>>>>
>>>>>> TCLSH:
>>>>>> /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>>>>>
>>>>>> JOB OUTPUT:
>>>>>>
>>>>>> Rank 2 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n3]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 2
>>>>>> Rank 4 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n3]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 4
>>>>>> Rank 6 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n1]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 6
>>>>>> Rank 3 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n2]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 3
>>>>>> Rank 5 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n0]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 5
>>>>>> Rank 1 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n2]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 1
>>>>>> Rank 7 [Thu Oct 2 21:15:32 2014] [c14-0c1s0n0]
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
>>>>>> process 7
>>>>>> _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu
>>>>>> Oct 2 21:15:32 2014] PE RANK 5 exit signal Aborted
>>>>>> _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu
>>>>>> Oct 2 21:15:32 2014] PE RANK 3 exit signal Aborted
>>>>>> _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu
>>>>>> Oct 2 21:15:32 2014] PE RANK 4 exit signal Aborted
>>>>>> _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu
>>>>>> Oct 2 21:15:32 2014] PE RANK 2 exit signal Aborted
>>>>>> [NID 04675] 2014-10-02 21:15:32 Apid 7706378:
>>>>>> initiated application termination
>>>>>> _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu
>>>>>> Oct 2 21:15:32 2014] PE RANK 6 exit signal Aborted
>>>>>> T$
>>>>>> T$
>>>>>> T$
>>>>>> T$ cat turbine.log
>>>>>> JOB: 2115044
>>>>>> COMMAND: stacking-faults3.tcl
>>>>>> HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>>>>>> SUBMITTED: 10/02/2014 09:14PM
>>>>>> PROCS: 128
>>>>>> PPN: 16
>>>>>> NODES: 8
>>>>>> TURBINE_WORKERS:
>>>>>> ADLB_SERVERS:
>>>>>> WALLTIME: 00:15:00
>>>>>> ADLB_EXHAUST_TIME:
>>>>>> T$
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Michael Wilde
>>>>>> Mathematics and Computer Science Computation
>>>>>> Institute
>>>>>> Argonne National Laboratory The University
>>>>>> of Chicago
>>>>>>
>>>>>> _______________________________________________
>>>>>> ExM-user mailing list
>>>>>> ExM-user at lists.mcs.anl.gov
>>>>>> <mailto:ExM-user at lists.mcs.anl.gov>
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ExM-user mailing list
>>>>>> ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>>
>>>>> --
>>>>> Michael Wilde
>>>>> Mathematics and Computer Science Computation Institute
>>>>> Argonne National Laboratory The University of Chicago
>>>>
>>>> --
>>>> Michael Wilde
>>>> Mathematics and Computer Science Computation Institute
>>>> Argonne National Laboratory The University of Chicago
>>>>
>>>>
>>>> _______________________________________________
>>>> ExM-user mailing list
>>>> ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>
>>>>
>>>
>>> --
>>> Michael Wilde
>>> Mathematics and Computer Science Computation Institute
>>> Argonne National Laboratory The University of Chicago
>>
>> --
>> Michael Wilde
>> Mathematics and Computer Science Computation Institute
>> Argonne National Laboratory The University of Chicago
>>
>> ------------------------------------------------------------------------
>>
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>>
>> --
>> Justin M Wozniak (via phone)
>
> --
> Michael Wilde
> Mathematics and Computer Science Computation Institute
> Argonne National Laboratory The University of Chicago
--
Justin M Wozniak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141016/a0b2777d/attachment-0001.html>
More information about the ExM-user
mailing list