[ExM Users] Can not diagnose Swift/T script failure
Michael Wilde
wilde at anl.gov
Thu Oct 2 20:42:47 CDT 2014
Im getting a failure from a Swift/T run on Titan that I cant diagnose.
The script is running 16K DISCUS simulations, about 30 secs each, 16 per
node, on 8 nodes. 127 workers, 1 Server. It seems to run well at smaller
scale (eg 1K simulations).
My PBS output file shows that some of the initial round of simulations
fail. It seems that when this happens, the Swift script exits within
the first round of simulations, and none seem to even start.
Some of the logs are below. I'll continue to experiment and add more
logging to try to isolate the cause.
I was initially suspicious of an OOM problem, but I dont see any sign of
that.
In the line below "Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell what Swift
things the exit status of these failing apps are? What is causing the
whole run to abort?
It almost looks like one app, or a small number of them, are generating
a SIGABRT, but thats not clear, and I have no stdout/err info from the
app processes.
I'll look for some more detailed debug options, and/or add some wrappers
around the apps.
Thanks,
- Mike
---
T$ cat output.txt.2115044.out
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
0.004_0.264.inte 0.004_0.264.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
0.004_0.272.inte 0.004_0.272.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.26 0.052000000000000005 1 0.75
../Si.cll 0.260_0.052.inte 0.260_0.052.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
0.004_0.084.inte 0.004_0.084.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
0.260_0.024.inte 0.260_0.024.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
0.004_0.056.inte 0.004_0.056.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
0.004_0.364.inte 0.004_0.364.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.004 0.07200000000000001 1 0.75
../Si.cll 0.004_0.072.inte 0.004_0.072.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Swift: external command failed:
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
--macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
0.260_0.044.inte 0.260_0.044.stru
Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Application 7706378 exit codes: 134
Application 7706378 exit signals: Killed
Application 7706378 resources: utime ~5s, stime ~19s, Rss ~11016,
inblocks ~48986, outblocks ~205
T$
---
T$ cat output.txt
bash_profile: loading modules
DB /ccs/home/wildemj/.modules: loading modules...
Turbine: turbine-aprun.sh
10/02/2014 09:15PM
TURBINE_HOME: /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
SCRIPT:
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
PROCS: 128
NODES: 8
PPN: 16
WALLTIME: 00:15:00
TURBINE_WORKERS: 127
ADLB_SERVERS: 1
TCLSH: /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
JOB OUTPUT:
Rank 2 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n3] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 2
Rank 4 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n3] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 4
Rank 6 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n1] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 6
Rank 3 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n2] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 3
Rank 5 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n0] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 5
Rank 1 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n2] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Rank 7 [Thu Oct 2 21:15:32 2014] [c14-0c1s0n0] application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 7
_pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct 2 21:15:32
2014] PE RANK 5 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct 2 21:15:32
2014] PE RANK 3 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct 2 21:15:32
2014] PE RANK 4 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct 2 21:15:32
2014] PE RANK 2 exit signal Aborted
[NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated application
termination
_pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct 2 21:15:32
2014] PE RANK 6 exit signal Aborted
T$
T$
T$
T$ cat turbine.log
JOB: 2115044
COMMAND: stacking-faults3.tcl
HOSTNAME: ccs.ornl.gov
SUBMITTED: 10/02/2014 09:14PM
PROCS: 128
PPN: 16
NODES: 8
TURBINE_WORKERS:
ADLB_SERVERS:
WALLTIME: 00:15:00
ADLB_EXHAUST_TIME:
T$
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
More information about the ExM-user
mailing list