[ExM Users] Can not diagnose Swift/T script failure

Tim Armstrong tim.g.armstrong at gmail.com
Wed Oct 15 16:27:26 CDT 2014


Did you reach a resolution on this?  I'm back in town now and could look at
adding the ability to retrieve the exit code.

- Tim

On Fri, Oct 3, 2014 at 12:07 AM, Michael Wilde <wilde at anl.gov> wrote:

>  Also: identical code seems to run fine at this setting:
>   PPN=16 turbine-cray-run.zsh -s titan.settings -n 64 $tcl  # (4 nodes)
> but fails at this setting:
>   PPN=16 turbine-cray-run.zsh -s titan.settings -n 128 $tcl # (8 nodes)
>
> Changing the app that calls DISCUS directly back to a shell wrapper around
> DISCUS shows that when the job  fails the app (shell wrapper) is indeed
> executed, and logs a message to the app stdout file which I see. But I dont
> see a message after DISCUS.
>
> To test if perhaps many concurrent DISCUS invocations (127) are causing a
> problem which is fatal to the whole job, I added a random sleep before
> DISCUS.  This had no effect on the problem.
>
> Next I eliminated the actual DISCUS call, and just did a random sleep from
> 0 to 9 seconds. Here I discovered that all the jobs with sleep > 0 failed
> in the same manner as with discus, while the sleep 0 cases all printed all
> their echos and exited.
>
> So there is something more deeply wrong going on, having nothing to do
> with the DISCUS app. I'll try to narrow this down, and see if perhaps its
> happening on other non-Cray systems like Midway.
>
> - Mike
>
>
> On 10/2/14 10:52 PM, Michael Wilde wrote:
>
> Thanks, Tim. But, stderr and stdout were both redirected to files, and all
> .err and .out files are empty.
>
> I'll double-check that these redirects are working, but in prior runs, I
> did indeed get app error and output in those files.
>
> It looks more to me like the app is failing to launch.
>
> - Mike
>
> On 10/2/14 10:22 PM, Tim Armstrong wrote:
>
>  The app stderr/stdout should go to Swift/T stderr/stdout unless
> redirected.  The problem is most likely discus returning a non-zero error
> code.
>
>  - Tim
>
> On Thu, Oct 2, 2014 at 8:42 PM, Michael Wilde <wilde at anl.gov> wrote:
>
>>
>> Im getting a failure from a Swift/T run on Titan that I cant diagnose.
>>
>> The script is running 16K DISCUS simulations, about 30 secs each, 16 per
>> node, on 8 nodes. 127 workers, 1 Server. It seems to run well at smaller
>> scale (eg 1K simulations).
>>
>> My PBS output file shows that some of the initial round of simulations
>> fail.  It seems that when this happens, the Swift script exits within the
>> first round of simulations, and none seem to even start.
>>
>> Some of the logs are below.  I'll continue to experiment and add more
>> logging to try to isolate the cause.
>>
>> I was initially suspicious of an OOM problem, but I dont see any sign of
>> that.
>>
>> In the line below "Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>> 0.004_0.264.inte 0.004_0.264.stru" - is there a way to tell what Swift
>> things the exit status of these failing apps are?  What is causing the
>> whole run to abort?
>>
>> It almost looks like one app, or a small number of them, are generating a
>> SIGABRT, but thats not clear, and I have no stdout/err info from the app
>> processes.
>>
>> I'll look for some more detailed debug options, and/or add some wrappers
>> around the apps.
>>
>> Thanks,
>>
>> - Mike
>>
>> ---
>>
>> T$ cat output.txt.2115044.out
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.264 1 0.75 ../Si.cll
>> 0.004_0.264.inte 0.004_0.264.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.272 1 0.75 ../Si.cll
>> 0.004_0.272.inte 0.004_0.272.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.26 0.052000000000000005 1 0.75 ../Si.cll
>> 0.260_0.052.inte 0.260_0.052.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.084 1 0.75 ../Si.cll
>> 0.004_0.084.inte 0.004_0.084.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.26 0.024 1 0.75 ../Si.cll
>> 0.260_0.024.inte 0.260_0.024.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.056 1 0.75 ../Si.cll
>> 0.004_0.056.inte 0.004_0.056.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.364 1 0.75 ../Si.cll
>> 0.004_0.364.inte 0.004_0.364.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.004 0.07200000000000001 1 0.75 ../Si.cll
>> 0.004_0.072.inte 0.004_0.072.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>>
>> Swift: external command failed:
>> /lustre/atlas/proj-shared/mat049/scu/build/DiffuseCode/titan/discus/prog/discus
>> --macro ../stacking-faults3.mac 0.26 0.044 1 0.75 ../Si.cll
>> 0.260_0.044.inte 0.260_0.044.stru
>>
>> Swift: killing MPI job...
>> ADLB_Abort(1)
>> MPI_Abort(1)
>> Application 7706378 exit codes: 134
>> Application 7706378 exit signals: Killed
>> Application 7706378 resources: utime ~5s, stime ~19s, Rss ~11016,
>> inblocks ~48986, outblocks ~205
>> T$
>>
>> ---
>>
>> T$ cat output.txt
>> bash_profile: loading modules
>> DB /ccs/home/wildemj/.modules: loading modules...
>> Turbine: turbine-aprun.sh
>> 10/02/2014 09:15PM
>>
>> TURBINE_HOME: /lustre/atlas2/mat049/proj-shared/sfw/compute/turbine
>> SCRIPT:
>> /lustre/atlas2/mat049/proj-shared/wildemj/swift-discus/stacking-faults3.tcl
>> PROCS:        128
>> NODES:        8
>> PPN:          16
>> WALLTIME:     00:15:00
>>
>> TURBINE_WORKERS: 127
>> ADLB_SERVERS:    1
>>
>> TCLSH: /lustre/atlas2/mat049/proj-shared/sfw/tcl-8.6.2/bin/tclsh8.6
>>
>> JOB OUTPUT:
>>
>> Rank 2 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n3] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>> Rank 4 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n3] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>> Rank 6 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n1] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 6
>> Rank 3 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n2] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>> Rank 5 [Thu Oct  2 21:15:32 2014] [c14-0c1s1n0] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>> Rank 1 [Thu Oct  2 21:15:32 2014] [c12-0c1s1n2] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>> Rank 7 [Thu Oct  2 21:15:32 2014] [c14-0c1s0n0] application called
>> MPI_Abort(MPI_COMM_WORLD, 1) - process 7
>> _pmiu_daemon(SIGCHLD): [NID 06114] [c14-0c1s1n0] [Thu Oct  2 21:15:32
>> 2014] PE RANK 5 exit signal Aborted
>> _pmiu_daemon(SIGCHLD): [NID 06076] [c14-0c1s1n2] [Thu Oct  2 21:15:32
>> 2014] PE RANK 3 exit signal Aborted
>> _pmiu_daemon(SIGCHLD): [NID 06077] [c14-0c1s1n3] [Thu Oct  2 21:15:32
>> 2014] PE RANK 4 exit signal Aborted
>> _pmiu_daemon(SIGCHLD): [NID 04675] [c12-0c1s1n3] [Thu Oct  2 21:15:32
>> 2014] PE RANK 2 exit signal Aborted
>> [NID 04675] 2014-10-02 21:15:32 Apid 7706378: initiated application
>> termination
>> _pmiu_daemon(SIGCHLD): [NID 06115] [c14-0c1s1n1] [Thu Oct  2 21:15:32
>> 2014] PE RANK 6 exit signal Aborted
>> T$
>> T$
>> T$
>> T$ cat turbine.log
>> JOB:               2115044
>> COMMAND:           stacking-faults3.tcl
>> HOSTNAME:          ccs.ornl.gov
>> SUBMITTED:         10/02/2014 09:14PM
>> PROCS:             128
>> PPN:               16
>> NODES:             8
>> TURBINE_WORKERS:
>> ADLB_SERVERS:
>> WALLTIME:          00:15:00
>> ADLB_EXHAUST_TIME:
>> T$
>>
>>
>>
>>
>> --
>> Michael Wilde
>> Mathematics and Computer Science          Computation Institute
>> Argonne National Laboratory               The University of Chicago
>>
>> _______________________________________________
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>
>
>
>
> _______________________________________________
> ExM-user mailing listExM-user at lists.mcs.anl.govhttps://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
>
> --
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
>
>
> --
> Michael Wilde
> Mathematics and Computer Science          Computation Institute
> Argonne National Laboratory               The University of Chicago
>
>
> _______________________________________________
> ExM-user mailing list
> ExM-user at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20141015/f39f9aee/attachment.html>


More information about the ExM-user mailing list