[ExM Users] Can not diagnose Swift/T script failure
Michael Wilde
wilde at anl.gov
Wed Oct 15 16:31:50 CDT 2014
Status as of Monday (but I know Justin did a few commits since...)
-------- Forwarded Message --------
Subject: Re: Latest on the Swift/T Titan problem?
Date: Mon, 13 Oct 2014 16:11:31 -0500
From: Justin M Wozniak <wozniak at mcs.anl.gov>
To: Michael Wilde <wilde at anl.gov>
Pure shell: works
Pure Tcl: works
ADLB batcher: works
<------->
Swift/T: fails
Something in Turbine breaks the OS. I have been bisecting over a
stripped-down Turbine to identify exactly what this is. If I strip down
the Turbine link step, I can get it to work. So I think it is something
in our use of Cray modules. (I try to not use modules/softenv but it is
affecting the link.)
This may fix the Swift/NAMD issue, which would be a complementary
positive outcome.
On 10/13/2014 12:10 PM, Michael Wilde wrote:
> One avenue I was wondering about for debugging: What PrgEnv is the
> Swift/T on Titan built for?
> Im assuming gnu, but if so, how does that interact with the
> PrgEnv-intel needed for the DISCUS app?
>
> I would think to handle such cases we need to run Turbine under the
> PrgEnv it needs, and then run any app( ) calls that needs a different
> PrgEnv under a wrapper shell script that provides it, just for that
> process.
>
> Which may in turn raise the issue of whether "module load
> PrgEnv-intel" for example is scalable to hundredes to tens of
> thousands of concurrent invocations.
>
> - Mike
>
> On 10/13/14 11:38 AM, Justin M Wozniak wrote:
>>
>> Yes, I am testing launching tasks with a plain shell script and a
>> plain Tcl script. I will then move to the ADLB batcher program.
>>
>> On 10/13/2014 11:34 AM, Michael Wilde wrote:
>>> Hi Justin,
>>>
>>> I did not get to spend any time on this over the weekend.
>>>
>>> What is your latest assessment of the issue?
>>>
>>> I did some runs Fri night that were disturbing: two runs of 256
>>> sleeps on 128 cores, each yielding a different number of output
>>> files (but missing ~ 5 in one case and ~ 10 in the other).
>>>
>>> I was going to do a test of a simple MPI app forking tasks with
>>> system( ), to see if same failure pattern is seen or not, as #CPUs
>>> increases from 128 to 1024.
>>>
...
--
Justin M Wozniak
On 10/15/14 4:29 PM, Michael Wilde wrote:
> Justin, can you report on your latest findings on this?
>
> (I will fwd Justin's latest notes from Monday...)
>
> - Mike
>
> On 10/15/14 4:27 PM, Tim Armstrong wrote:
>> Did you reach a resolution on this? I'm back in town now and could
>> look at adding the ability to retrieve the exit code.
>>
>> - Tim
>>
>> On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov
>> <mailto:wilde at anl.gov>> wrote:
>>
>> Also: identical code seems to run fine at this setting:
>> PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl # (4
>> nodes)
>> but fails at this setting:
>> PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8
>> nodes)
>>
>> Changing the app that calls DISCUS directly back to a shell
>> wrapper around DISCUS shows that when the job fails the app
>> (shell wrapper) is indeed executed, and logs a message to the app
>> stdout file which I see. But I dont see a message after DISCUS.
>>
>> To test if perhaps many concurrent DISCUS invocations (127) are
>> causing a problem which is fatal to the whole job, I added a
>> random sleep before DISCUS. This had no effect on the problem.
>>
>> Next I eliminated the actual DISCUS call, and just did a random
>> sleep from 0 to 9 seconds. Here I discovered that all the jobs
>> with sleep > 0 failed in the same manner as with discus, while
>> the sleep 0 cases all printed all their echos and exited.
>>
>> So there is something more deeply wrong going on, having nothing
>> to do with the DISCUS app. I'll try to narrow this down, and see
>> if perhaps its happening on other non-Cray systems like Midway.
>>
>> - Mike
>>
>>
>> On 10/2/14 10:52 PM, Michael Wilde wrote:
>>> Thanks, Tim. But, stderr and stdout were both redirected to
>>> files, and all .err and .out files are empty.
>>>
>>> I'll double-check that these redirects are working, but in prior
>>> runs, I did indeed get app error and output in those files.
>>>
>>> It looks more to me like the app is failing to launch.
>>>
>>> - Mike
>>>
>>> On 10/2/14 10:22 PM, Tim Armstrong wrote:
>>>> The app stderr/stdout should go to Swift/T stderr/stdout unless
>>>> redirected. The problem is most likely discus returning a
>>>> non-zero error code.
>>>>
>>>> - Tim
>>>>
>>>> On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov
>>>> <mailto:wilde at anl.gov>> wrote:
>>>>
>>>>
>>>> Im getting a failure from a Swift/T run on Titan that I
>>>> cant diagnose.
>>>>
>>>> The script is running 16K DISCUS simulations, about 30 secs
>>>> each, 16 per node, on 8 nodes. 127 workers, 1 Server. It
>>>> seems to run well at smaller scale (eg 1K simulations).
>>>>
>>>> My PBS output file shows that some of the initial round of
>>>> simulations fail. It seems that when this happens, the
>>>> Swift script exits within the first round of simulations,
>>>> and none seem to even start.
>>>>
>>>> Some of the logs are below. I'll continue to experiment
>>>> and add more logging to try to isolate the cause.
>>>>
>>>> I was initially suspicious of an OOM problem, but I dont
>>>> see any sign of that.
>>>>
>>>> In the line below "Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>> ../Si.cll 0.004_0.264.inte 0.004_0.264.stru" - is there a
>>>> way to tell what Swift things the exit status of these
>>>> failing apps are? What is causing the whole run to abort?
>>>>
>>>> It almost looks like one app, or a small number of them,
>>>> are generating a SIGABRT, but thats not clear, and I have
>>>> no stdout/err info from the app processes.
>>>>
>>>> I'll look for some more detailed debug options, and/or add
>>>> some wrappers around the apps.
>>>>
>>>> Thanks,
>>>>
>>>> - Mike
>>>>
>>>> ---
>>>>
>>>> T$ cat output.txt.2115044.out
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75
>>>> ../Si.cll 0.004_0.264.inte 0.004_0.264.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.272 1 0.75
>>>> ../Si.cll 0.004_0.272.inte 0.004_0.272.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1
>>>> 0.75 ../Si.cll 0.260_0.052.inte 0.260_0.052.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.084 1 0.75
>>>> ../Si.cll 0.004_0.084.inte 0.004_0.084.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
>>>> 0.260_0.024.inte 0.260_0.024.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.056 1 0.75
>>>> ../Si.cll 0.004_0.056.inte 0.004_0.056.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.364 1 0.75
>>>> ../Si.cll 0.004_0.364.inte 0.004_0.364.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1
>>>> 0.75 ../Si.cll 0.004_0.072.inte 0.004_0.072.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>>
>>>> Swift: external command failed:
>>>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>> --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
>>>> 0.260_0.044.inte 0.260_0.044.stru
>>>>
>>>> Swift: killing MPI job...
>>>> ADLB_Abort(1)
>>>> MPI_Abort(1)
>>>> Application 7706378 exit codes: 134
>>>> Application 7706378 exit signals: Killed
>>>> Application 7706378 resources: utime ~5s, stime ~19s, Rss
>>>> ~11016, inblocks ~48986, outblocks ~205
>>>> T$
>>>>
>>>> ---
>>>>
>>>> T$ cat output.txt
>>>> bash_profile: loading modules
>>>> DB /ccs/home/wildemj/.modules: loading modules...
>>>> Turbine: turbine-aprun.sh
>>>> 10/02/2014 09:15PM
>>>>
>>>> TURBINE_HOME:
>>>> /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>>>> SCRIPT:
>>>> /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>>>> PROCS: 128
>>>> NODES: 8
>>>> PPN: 16
>>>> WALLTIME: 00:15:00
>>>>
>>>> TURBINE_WORKERS: 127
>>>> ADLB_SERVERS: 1
>>>>
>>>> TCLSH:
>>>> /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>>>
>>>> JOB OUTPUT:
>>>>
>>>> Rank 2 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n3] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>>>> Rank 4 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n3] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>>>> Rank 6 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n1] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>>>> Rank 3 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n2] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>>>> Rank 5 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n0] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>>>> Rank 1 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n2] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>>> Rank 7 [Thu Oct 2 21:15:32 2014] [c14-0c1s0n0] application
>>>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>>>> _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct
>>>> 2 21:15:32 2014] PE RANK 5 exit signal Aborted
>>>> _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct
>>>> 2 21:15:32 2014] PE RANK 3 exit signal Aborted
>>>> _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct
>>>> 2 21:15:32 2014] PE RANK 4 exit signal Aborted
>>>> _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct
>>>> 2 21:15:32 2014] PE RANK 2 exit signal Aborted
>>>> [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
>>>> application termination
>>>> _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct
>>>> 2 21:15:32 2014] PE RANK 6 exit signal Aborted
>>>> T$
>>>> T$
>>>> T$
>>>> T$ cat turbine.log
>>>> JOB: 2115044
>>>> COMMAND: stacking-faults3.tcl
>>>> HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>>>> SUBMITTED: 10/02/2014 09:14PM
>>>> PROCS: 128
>>>> PPN: 16
>>>> NODES: 8
>>>> TURBINE_WORKERS:
>>>> ADLB_SERVERS:
>>>> WALLTIME: 00:15:00
>>>> ADLB_EXHAUST_TIME:
>>>> T$
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Michael Wilde
>>>> Mathematics and Computer Science Computation Institute
>>>> Argonne National Laboratory The University of Chicago
>>>>
>>>> _______________________________________________
>>>> ExM-user mailing list
>>>> ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ExM-user mailing list
>>>> ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>
>>> --
>>> Michael Wilde
>>> Mathematics and Computer Science Computation Institute
>>> Argonne National Laboratory The University of Chicago
>>
>> --
>> Michael Wilde
>> Mathematics and Computer Science Computation Institute
>> Argonne National Laboratory The University of Chicago
>>
>>
>> _______________________________________________
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>>
>
> --
> Michael Wilde
> Mathematics and Computer Science Computation Institute
> Argonne National Laboratory The University of Chicago
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141015/26f5b6af/attachment-0001.html>
More information about the ExM-user
mailing list