[Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Cormac Garvey cormac.t.garvey at gmail.com
Thu Sep 24 16:15:51 CDT 2020


Another process has this traceback.

(gdb) where
#0  0x00002ae77d3b4fe8 in mv2_shm_bcast () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#1  0x00002ae77d3985fe in MPIR_Shmem_Bcast_MV2 () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#2  0x00002ae77d38f24a in MPIR_Allreduce_two_level_MV2 () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#3  0x00002ae77d391d01 in MPIR_Allreduce_index_tuned_intra_MV2 () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#4  0x00002ae77d33a586 in MPIR_Allreduce_impl () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#5  0x00002ae77d72f42f in MPIR_Get_contextid_sparse_group () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#6  0x00002ae77d6c5368 in MPIR_Comm_create_intra () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#7  0x00002ae77d6c56fc in PMPI_Comm_create () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#8  0x00002ae77d6d096d in create_allgather_comm () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#9  0x00002ae77d3b221c in mv2_increment_allgather_coll_counter () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#10 0x00002ae77d33adcf in PMPI_Allreduce () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#11 0x00002ae77cd8e91b in darshan_core_shutdown () at
../darshan-runtime/lib/darshan-core.c:641
#12 0x00002ae77cd89d73 in MPI_Finalize () at
../darshan-runtime/lib/darshan-core-init-finalize.c:85
#13 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized
out>) at ior.c:156
#14 0x00002ae77db9a505 in __libc_start_main () from /lib64/libc.so.6
#15 0x0000000000402fee in _start ()
(gdb)


Thanks,
Cormac.

On Thu, Sep 24, 2020 at 4:12 PM Cormac Garvey <cormac.t.garvey at gmail.com>
wrote:

> Thanks Rob and Shane.
>
> Here is the gdb traceback attached to one of the hung ior processes (using
> Darshan LD_PRELOAD).
>
> (gdb) where
> #0  0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #1  0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #2  0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #3  0x00002ba2e38d8ffb in MPIR_Barrier_impl () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #4  0x00002ba2e3c8900b in PMPI_Finalize () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #5  0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized
> out>) at ior.c:156
> #6  0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6
> #7  0x0000000000402fee in _start ()
> (gdb)
>
> Thanks for your support,
> Cormac.
>
> On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <robl at mcs.anl.gov>
> wrote:
>
>> On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:
>> > The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs
>> > correctly if I remove the LD_PRELOAD)
>>
>> If at all possible, can you attach a debugger to some of the hung IOR
>> jobs and give us a backtrace?  It will be really helpful to know what
>> operation these proceses are stuck in.
>>
>> I don't know your exact environment but attaching to proceses probably
>> looks like
>>
>> - ask PBS where your jobs are running
>> - ssh to a client node
>> - get the pid of the ior processs on that node
>> - "gdb -ex where -ex quit -p 1234" (or whatever that process id is)
>>
>> ==rob
>>
>> >
>> > When I kill the job, i get the following information in PBS stderr
>> > file.
>> >
>> > "darshan_library_warning: unable to write header to file
>> > /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-
>> > 2174583250033636718.darshan_partial."
>> >
>> > The IOR benchmark completes (i.e can see the I/O stats), but does not
>> > appear to exit correctly (remains in a hung state until I kill it)
>> >
>> >
>> > Running a similar job with IOR+mpich or IOR+OpenMPI works fine with
>> > darshan.
>> >
>> > Any ideas what I am missing?
>> >
>> > Thanks for your support.
>> >
>> > Regards,
>> > Cormac.
>> > _______________________________________________
>> > Darshan-users mailing list
>> > Darshan-users at lists.mcs.anl.gov
>> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200924/9b4dbd6d/attachment.html>


More information about the Darshan-users mailing list