[ExM Users] Can not diagnose Swift/T script failure

Michael Wilde wilde at anl.gov
Wed Oct 15 16:29:26 CDT 2014


Justin, can you report on your latest findings on this?

(I will fwd Justin's latest notes from Monday...)

- Mike

On 10/15/14 4:27 PM, Tim Armstrong wrote:
> Did you reach a resolution on this?  I'm back in town now and could 
> look at adding the ability to retrieve the exit code.
>
> - Tim
>
> On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov 
> <mailto:wilde at anl.gov>> wrote:
>
>     Also: identical code seems to run fine at this setting:
>       PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl  # (4
>     nodes)
>     but fails at this setting:
>       PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8
>     nodes)
>
>     Changing the app that calls DISCUS directly back to a shell
>     wrapper around DISCUS shows that when the job  fails the app
>     (shell wrapper) is indeed executed, and logs a message to the app
>     stdout file which I see. But I dont see a message after DISCUS.
>
>     To test if perhaps many concurrent DISCUS invocations (127) are
>     causing a problem which is fatal to the whole job, I added a
>     random sleep before DISCUS.  This had no effect on the problem.
>
>     Next I eliminated the actual DISCUS call, and just did a random
>     sleep from 0 to 9 seconds. Here I discovered that all the jobs
>     with sleep > 0 failed in the same manner as with discus, while the
>     sleep 0 cases all printed all their echos and exited.
>
>     So there is something more deeply wrong going on, having nothing
>     to do with the DISCUS app. I'll try to narrow this down, and see
>     if perhaps its happening on other non-Cray systems like Midway.
>
>     - Mike
>
>
>     On 10/2/14 10:52 PM, Michael Wilde wrote:
>>     Thanks, Tim. But, stderr and stdout were both redirected to
>>     files, and all .err and .out files are empty.
>>
>>     I'll double-check that these redirects are working, but in prior
>>     runs, I did indeed get app error and output in those files.
>>
>>     It looks more to me like the app is failing to launch.
>>
>>     - Mike
>>
>>     On 10/2/14 10:22 PM, Tim Armstrong wrote:
>>>     The app stderr/stdout should go to Swift/T stderr/stdout unless
>>>     redirected.  The problem is most likely discus returning a
>>>     non-zero error code.
>>>
>>>     - Tim
>>>
>>>     On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov
>>>     <mailto:wilde at anl.gov>> wrote:
>>>
>>>
>>>         Im getting a failure from a Swift/T run on Titan that I cant
>>>         diagnose.
>>>
>>>         The script is running 16K DISCUS simulations, about 30 secs
>>>         each, 16 per node, on 8 nodes. 127 workers, 1 Server. It
>>>         seems to run well at smaller scale (eg 1K simulations).
>>>
>>>         My PBS output file shows that some of the initial round of
>>>         simulations fail.  It seems that when this happens, the
>>>         Swift script exits within the first round of simulations,
>>>         and none seem to even start.
>>>
>>>         Some of the logs are below.  I'll continue to experiment and
>>>         add more logging to try to isolate the cause.
>>>
>>>         I was initially suspicious of an OOM problem, but I dont see
>>>         any sign of that.
>>>
>>>         In the line below "Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>>>         0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell
>>>         what Swift things the exit status of these failing apps
>>>         are?  What is causing the whole run to abort?
>>>
>>>         It almost looks like one app, or a small number of them, are
>>>         generating a SIGABRT, but thats not clear, and I have no
>>>         stdout/err info from the app processes.
>>>
>>>         I'll look for some more detailed debug options, and/or add
>>>         some wrappers around the apps.
>>>
>>>         Thanks,
>>>
>>>         - Mike
>>>
>>>         ---
>>>
>>>         T$ cat output.txt.2115044.out
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>>>         0.004_0.264.inte 0.004_0.264.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
>>>         0.004_0.272.inte 0.004_0.272.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1
>>>         0.75 ../Si.cll 0.260_0.052.inte 0.260_0.052.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
>>>         0.004_0.084.inte 0.004_0.084.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
>>>         0.260_0.024.inte 0.260_0.024.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
>>>         0.004_0.056.inte 0.004_0.056.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
>>>         0.004_0.364.inte 0.004_0.364.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1
>>>         0.75 ../Si.cll 0.004_0.072.inte 0.004_0.072.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>
>>>         Swift: external command failed:
>>>         /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>>>         --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
>>>         0.260_0.044.inte 0.260_0.044.stru
>>>
>>>         Swift: killing MPI job...
>>>         ADLB_Abort(1)
>>>         MPI_Abort(1)
>>>         Application 7706378 exit codes: 134
>>>         Application 7706378 exit signals: Killed
>>>         Application 7706378 resources: utime ~5s, stime ~19s, Rss
>>>         ~11016, inblocks ~48986, outblocks ~205
>>>         T$
>>>
>>>         ---
>>>
>>>         T$ cat output.txt
>>>         bash_profile: loading modules
>>>         DB /ccs/home/wildemj/.modules: loading modules...
>>>         Turbine: turbine-aprun.sh
>>>         10/02/2014 09:15PM
>>>
>>>         TURBINE_HOME:
>>>         /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>>>         SCRIPT:
>>>         /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>>>         PROCS:        128
>>>         NODES:        8
>>>         PPN:          16
>>>         WALLTIME:     00:15:00
>>>
>>>         TURBINE_WORKERS: 127
>>>         ADLB_SERVERS:    1
>>>
>>>         TCLSH:
>>>         /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>>
>>>         JOB OUTPUT:
>>>
>>>         Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>>>         Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>>>         Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>>>         Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>>>         Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>>>         Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>>         Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0] application
>>>         called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>>>         _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct  2
>>>         21:15:32 2014] PE RANK 5 exit signal Aborted
>>>         _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct  2
>>>         21:15:32 2014] PE RANK 3 exit signal Aborted
>>>         _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct  2
>>>         21:15:32 2014] PE RANK 4 exit signal Aborted
>>>         _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct  2
>>>         21:15:32 2014] PE RANK 2 exit signal Aborted
>>>         [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
>>>         application termination
>>>         _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct  2
>>>         21:15:32 2014] PE RANK 6 exit signal Aborted
>>>         T$
>>>         T$
>>>         T$
>>>         T$ cat turbine.log
>>>         JOB:               2115044
>>>         COMMAND:           stacking-faults3.tcl
>>>         HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>>>         SUBMITTED:         10/02/2014 09:14PM
>>>         PROCS:             128
>>>         PPN:               16
>>>         NODES:             8
>>>         TURBINE_WORKERS:
>>>         ADLB_SERVERS:
>>>         WALLTIME:          00:15:00
>>>         ADLB_EXHAUST_TIME:
>>>         T$
>>>
>>>
>>>
>>>
>>>         -- 
>>>         Michael Wilde
>>>         Mathematics and Computer Science   Computation Institute
>>>         Argonne National Laboratory  The University of Chicago
>>>
>>>         _______________________________________________
>>>         ExM-user mailing list
>>>         ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>>>         https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>
>>>
>>>
>>>
>>>     _______________________________________________
>>>     ExM-user mailing list
>>>     ExM-user at lists.mcs.anl.gov  <mailto:ExM-user at lists.mcs.anl.gov>
>>>     https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>>     -- 
>>     Michael Wilde
>>     Mathematics and Computer Science          Computation Institute
>>     Argonne National Laboratory               The University of Chicago
>
>     -- 
>     Michael Wilde
>     Mathematics and Computer Science          Computation Institute
>     Argonne National Laboratory               The University of Chicago
>
>
>     _______________________________________________
>     ExM-user mailing list
>     ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>     https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
>

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141015/4351a9e7/attachment-0001.html>


More information about the ExM-user mailing list