[ExM Users] Can not diagnose Swift/T script failure
Michael Wilde
wilde at anl.gov
Fri Oct 3 00:07:17 CDT 2014
Also: identical code seems to run fine at this setting:
PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl # (4 nodes)
but fails at this setting:
PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8 nodes)
Changing the app that calls DISCUS directly back to a shell wrapper
around DISCUS shows that when the job fails the app (shell wrapper) is
indeed executed, and logs a message to the app stdout file which I see.
But I dont see a message after DISCUS.
To test if perhaps many concurrent DISCUS invocations (127) are causing
a problem which is fatal to the whole job, I added a random sleep before
DISCUS. This had no effect on the problem.
Next I eliminated the actual DISCUS call, and just did a random sleep
from 0 to 9 seconds. Here I discovered that all the jobs with sleep > 0
failed in the same manner as with discus, while the sleep 0 cases all
printed all their echos and exited.
So there is something more deeply wrong going on, having nothing to do
with the DISCUS app. I'll try to narrow this down, and see if perhaps
its happening on other non-Cray systems like Midway.
- Mike
On 10/2/14 10:52 PM, Michael Wilde wrote:
> Thanks, Tim. But, stderr and stdout were both redirected to files, and
> all .err and .out files are empty.
>
> I'll double-check that these redirects are working, but in prior runs,
> I did indeed get app error and output in those files.
>
> It looks more to me like the app is failing to launch.
>
> - Mike
>
> On 10/2/14 10:22 PM, Tim Armstrong wrote:
>> The app stderr/stdout should go to Swift/T stderr/stdout unless
>> redirected. The problem is most likely discus returning a non-zero
>> error code.
>>
>> - Tim
>>
>> On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov
>> <mailto:wilde at anl.gov>> wrote:
>>
>>
>> Im getting a failure from a Swift/T run on Titan that I cant
>> diagnose.
>>
>> The script is running 16K DISCUS simulations, about 30 secs each,
>> 16 per node, on 8 nodes. 127 workers, 1 Server. It seems to run
>> well at smaller scale (eg 1K simulations).
>>
>> My PBS output file shows that some of the initial round of
>> simulations fail. It seems that when this happens, the Swift
>> script exits within the first round of simulations, and none seem
>> to even start.
>>
>> Some of the logs are below. I'll continue to experiment and add
>> more logging to try to isolate the cause.
>>
>> I was initially suspicious of an OOM problem, but I dont see any
>> sign of that.
>>
>> In the line below "Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>> 0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell what
>> Swift things the exit status of these failing apps are? What is
>> causing the whole run to abort?
>>
>> It almost looks like one app, or a small number of them, are
>> generating a SIGABRT, but thats not clear, and I have no
>> stdout/err info from the app processes.
>>
>> I'll look for some more detailed debug options, and/or add some
>> wrappers around the apps.
>>
>> Thanks,
>>
>> - Mike
>>
>> ---
>>
>> T$ cat output.txt.2115044.out
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>> 0.004_0.264.inte 0.004_0.264.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
>> 0.004_0.272.inte 0.004_0.272.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1 0.75
>> ../Si.cll 0.260_0.052.inte 0.260_0.052.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
>> 0.004_0.084.inte 0.004_0.084.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
>> 0.260_0.024.inte 0.260_0.024.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
>> 0.004_0.056.inte 0.004_0.056.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
>> 0.004_0.364.inte 0.004_0.364.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1 0.75
>> ../Si.cll 0.004_0.072.inte 0.004_0.072.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
>> 0.260_0.044.inte 0.260_0.044.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>> Application 7706378 exit codes: 134
>> Application 7706378 exit signals: Killed
>> Application 7706378 resources: utime ~5s, stime ~19s, Rss ~11016,
>> inblocks ~48986, outblocks ~205
>> T$
>>
>> ---
>>
>> T$ cat output.txt
>> bash_profile: loading modules
>> DB /ccs/home/wildemj/.modules: loading modules...
>> Turbine: turbine-aprun.sh
>> 10/02/2014 09:15PM
>>
>> TURBINE_HOME: /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>> SCRIPT:
>> /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>> PROCS: 128
>> NODES: 8
>> PPN: 16
>> WALLTIME: 00:15:00
>>
>> TURBINE_WORKERS: 127
>> ADLB_SERVERS: 1
>>
>> TCLSH: /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>
>> JOB OUTPUT:
>>
>> Rank 2 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n3] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>> Rank 4 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n3] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>> Rank 6 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n1] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>> Rank 3 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n2] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>> Rank 5 [Thu Oct 2 21:15:32 2014] [c14-0c1s1n0] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>> Rank 1 [Thu Oct 2 21:15:32 2014] [c12-0c1s1n2] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>> Rank 7 [Thu Oct 2 21:15:32 2014] [c14-0c1s0n0] application
>> called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>> _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct 2
>> 21:15:32 2014] PE RANK 5 exit signal Aborted
>> _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct 2
>> 21:15:32 2014] PE RANK 3 exit signal Aborted
>> _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct 2
>> 21:15:32 2014] PE RANK 4 exit signal Aborted
>> _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct 2
>> 21:15:32 2014] PE RANK 2 exit signal Aborted
>> [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated
>> application termination
>> _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct 2
>> 21:15:32 2014] PE RANK 6 exit signal Aborted
>> T$
>> T$
>> T$
>> T$ cat turbine.log
>> JOB: 2115044
>> COMMAND: stacking-faults3.tcl
>> HOSTNAME: ccs.ornl.gov <http://ccs.ornl.gov>
>> SUBMITTED: 10/02/2014 09:14PM
>> PROCS: 128
>> PPN: 16
>> NODES: 8
>> TURBINE_WORKERS:
>> ADLB_SERVERS:
>> WALLTIME: 00:15:00
>> ADLB_EXHAUST_TIME:
>> T$
>>
>>
>>
>>
>> --
>> Michael Wilde
>> Mathematics and Computer Science Computation Institute
>> Argonne National Laboratory The University of Chicago
>>
>> _______________________________________________
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov <mailto:ExM-user at lists.mcs.anl.gov>
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>>
>>
>>
>> _______________________________________________
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
> --
> Michael Wilde
> Mathematics and Computer Science Computation Institute
> Argonne National Laboratory The University of Chicago
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141003/3e7fbf99/attachment-0001.html>
More information about the ExM-user
mailing list