[ExM Users] Can not diagnose Swift/T script failure

Michael Wilde wilde at anl.gov
Thu Oct 2 20:42:47 CDT 2014


Im getting a failure from a Swift/T run on Titan that I cant diagnose.

The script is running 16K DISCUS simulations, about 30 secs each, 16 per 
node, on 8 nodes. 127 workers, 1 Server. It seems to run well at smaller 
scale (eg 1K simulations).

My PBS output file shows that some of the initial round of simulations 
fail.  It seems that when this happens, the Swift script exits within 
the first round of simulations, and none seem to even start.

Some of the logs are below.  I'll continue to experiment and add more 
logging to try to isolate the cause.

I was initially suspicious of an OOM problem, but I dont see any sign of 
that.

In the line below "Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll 
0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell what Swift 
things the exit status of these failing apps are?  What is causing the 
whole run to abort?

It almost looks like one app, or a small number of them, are generating 
a SIGABRT, but thats not clear, and I have no stdout/err info from the 
app processes.

I'll look for some more detailed debug options, and/or add some wrappers 
around the apps.

Thanks,

- Mike

---

T$ cat output.txt.2115044.out

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll 
0.004_0.264.inte 0.004_0.264.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll 
0.004_0.272.inte 0.004_0.272.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.26 0.052000000000000005 1 0.75 
../Si.cll 0.260_0.052.inte 0.260_0.052.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll 
0.004_0.084.inte 0.004_0.084.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll 
0.260_0.024.inte 0.260_0.024.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll 
0.004_0.056.inte 0.004_0.056.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll 
0.004_0.364.inte 0.004_0.364.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.004 0.07200000000000001 1 0.75 
../Si.cll 0.004_0.072.inte 0.004_0.072.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)

Swift: external command failed: 
/lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus 
--macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll 
0.260_0.044.inte 0.260_0.044.stru

Swift: killing MPI job...
ADLB_Abort(1)
MPI_Abort(1)
Application 7706378 exit codes: 134
Application 7706378 exit signals: Killed
Application 7706378 resources: utime ~5s, stime ~19s, Rss ~11016, 
inblocks ~48986, outblocks ~205
T$

---

T$ cat output.txt
bash_profile: loading modules
DB /ccs/home/wildemj/.modules: loading modules...
Turbine: turbine-aprun.sh
10/02/2014 09:15PM

TURBINE_HOME: /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
SCRIPT: 
/lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
PROCS:        128
NODES:        8
PPN:          16
WALLTIME:     00:15:00

TURBINE_WORKERS: 127
ADLB_SERVERS:    1

TCLSH: /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6

JOB OUTPUT:

Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 2
Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 4
Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 6
Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 3
Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 5
Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0] application called 
MPI_Abort(MPI_COMM_WORLD, 1) - process 7
_pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct  2 21:15:32 
2014] PE RANK 5 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct  2 21:15:32 
2014] PE RANK 3 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct  2 21:15:32 
2014] PE RANK 4 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct  2 21:15:32 
2014] PE RANK 2 exit signal Aborted
[NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated application 
termination
_pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct  2 21:15:32 
2014] PE RANK 6 exit signal Aborted
T$
T$
T$
T$ cat turbine.log
JOB:               2115044
COMMAND:           stacking-faults3.tcl
HOSTNAME:          ccs.ornl.gov
SUBMITTED:         10/02/2014 09:14PM
PROCS:             128
PPN:               16
NODES:             8
TURBINE_WORKERS:
ADLB_SERVERS:
WALLTIME:          00:15:00
ADLB_EXHAUST_TIME:
T$




-- 
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory               The University of Chicago



More information about the ExM-user mailing list