[ExM Users] Can not diagnose Swift/T script failure

Michael Wilde wilde at anl.gov
Thu Oct 2 22:52:57 CDT 2014


Thanks, Tim. But, stderr and stdout were both redirected to files, and 
all .err and .out files are empty.

I'll double-check that these redirects are working, but in prior runs, I 
did indeed get app error and output in those files.

It looks more to me like the app is failing to launch.

- Mike

On 10/2/14 10:22 PM, Tim Armstrong wrote:
> The app stderr/stdout should go to Swift/T stderr/stdout unless 
> redirected.  The problem is most likely discus returning a non-zero 
> error code.
>
> - Tim
>
> On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov 
> <mailto:wilde at anl.gov>> wrote:
>
>
>     Im getting a failure from a Swift/T run on Titan that I cant diagnose.
>
>     The script is running 16K DISCUS simulations, about 30 secs each,
>     16 per node, on 8 nodes. 127 workers, 1 Server. It seems to run
>     well at smaller scale (eg 1K simulations).
>
>     My PBS output file shows that some of the initial round of
>     simulations fail.  It seems that when this happens, the Swift
>     script exits within the first round of simulations, and none seem
>     to even start.
>
>     Some of the logs are below.  I'll continue to experiment and add
>     more logging to try to isolate the cause.
>
>     I was initially suspicious of an OOM problem, but I dont see any
>     sign of that.
>
>     In the line below "Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>     0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell what
>     Swift things the exit status of these failing apps are?  What is
>     causing the whole run to abort?
>
>     It almost looks like one app, or a small number of them, are
>     generating a SIGABRT, but thats not clear, and I have no
>     stdout/err info from the app processes.
>
>     I'll look for some more detailed debug options, and/or add some
>     wrappers around the apps.
>
>     Thanks,
>
>     - Mike
>
>     ---
>
>     T$ cat output.txt.2115044.out
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>     0.004_0.264.inte 0.004_0.264.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
>     0.004_0.272.inte 0.004_0.272.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1 0.75
>     ../Si.cll 0.260_0.052.inte 0.260_0.052.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
>     0.004_0.084.inte 0.004_0.084.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
>     0.260_0.024.inte 0.260_0.024.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
>     0.004_0.056.inte 0.004_0.056.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
>     0.004_0.364.inte 0.004_0.364.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1 0.75
>     ../Si.cll 0.004_0.072.inte 0.004_0.072.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>
>     Swift: external command failed:
>     /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>     --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
>     0.260_0.044.inte 0.260_0.044.stru
>
>     Swift: killing MPI job...
>     ADLB_Abort(1)
>     MPI_Abort(1)
>     Application 7706378 exit codes: 134
>     Application 7706378 exit signals: Killed
>     Application 7706378 resources: utime ~5s, stime ~19s, Rss ~11016,
>     inblocks ~48986, outblocks ~205
>     T$
>
>     ---
>
>     T$ cat output.txt
>     bash_profile: loading modules
>     DB /ccs/home/wildemj/.modules: loading modules...
>     Turbine: turbine-aprun.sh
>     10/02/2014 09:15PM
>
>     TURBINE_HOME: /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>     SCRIPT:
>     /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>     PROCS:        128
>     NODES:        8
>     PPN:          16
>     WALLTIME:     00:15:00
>
>     TURBINE_WORKERS: 127
>     ADLB_SERVERS:    1
>
>     TCLSH: /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>
>     JOB OUTPUT:
>
>     Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>     Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>     Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>     Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>     Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>     Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>     Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0] application called
>     MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>     _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct  2
>     21:15:32 2014] PE RANK 5 exit signal Aborted
>     _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct  2
>     21:15:32 2014] PE RANK 3 exit signal Aborted
>     _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct  2
>     21:15:32 2014] PE RANK 4 exit signal Aborted
>     _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct  2
>     21:15:32 2014] PE RANK 2 exit signal Aborted
>     [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
>     application termination
>     _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct  2
>     21:15:32 2014] PE RANK 6 exit signal Aborted
>     T$
>     T$
>     T$
>     T$ cat turbine.log
>     JOB:               2115044
>     COMMAND:           stacking-faults3.tcl
>     HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>     SUBMITTED:         10/02/2014 09:14PM
>     PROCS:             128
>     PPN:               16
>     NODES:             8
>     TURBINE_WORKERS:
>     ADLB_SERVERS:
>     WALLTIME:          00:15:00
>     ADLB_EXHAUST_TIME:
>     T$
>
>
>
>
>     -- 
>     Michael Wilde
>     Mathematics and Computer Science          Computation Institute
>     Argonne National Laboratory               The University of Chicago
>
>     _______________________________________________
>     ExM-user mailing list
>     ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>     https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
>
>
>
> _______________________________________________
> ExM-user mailing list
> ExM-user at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/exm-user

-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141002/07ff455c/attachment-0001.html>


More information about the ExM-user mailing list