[Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Snyder, Shane ssnyder at mcs.anl.gov
Mon Sep 28 12:18:09 CDT 2020


Hi Cormac,

Thanks for the additional details! I was able to reproduce this myself using the same mvapich2/darshan versions you provided.

I also tried mvapich2 2.3.4 and hit a similar problem, though the stack traces were slightly different.

FWIW, I did not see these errors when testing 3.1.x versions (e.g., 3.1.8) of Darshan, so you could consider rolling back to an older version if possible. That said, there are a couple of bug fixes in newer versions that might cause you compilation issues for 3.1.x versions, but may be worth trying -- I had to hack around these myself just to test this.

At any rate, I wanted to just provide a quick update that we are looking into the problem and hopefully we have a better understanding of what's going wrong soon.

--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Cormac Garvey <cormac.t.garvey at gmail.com>
Sent: Thursday, September 24, 2020 4:15 PM
To: Latham, Robert J. <robl at mcs.anl.gov>
Cc: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Another process has this traceback.

(gdb) where
#0  0x00002ae77d3b4fe8 in mv2_shm_bcast () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#1  0x00002ae77d3985fe in MPIR_Shmem_Bcast_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#2  0x00002ae77d38f24a in MPIR_Allreduce_two_level_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#3  0x00002ae77d391d01 in MPIR_Allreduce_index_tuned_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#4  0x00002ae77d33a586 in MPIR_Allreduce_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#5  0x00002ae77d72f42f in MPIR_Get_contextid_sparse_group () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#6  0x00002ae77d6c5368 in MPIR_Comm_create_intra () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#7  0x00002ae77d6c56fc in PMPI_Comm_create () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#8  0x00002ae77d6d096d in create_allgather_comm () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#9  0x00002ae77d3b221c in mv2_increment_allgather_coll_counter () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#10 0x00002ae77d33adcf in PMPI_Allreduce () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#11 0x00002ae77cd8e91b in darshan_core_shutdown () at ../darshan-runtime/lib/darshan-core.c:641
#12 0x00002ae77cd89d73 in MPI_Finalize () at ../darshan-runtime/lib/darshan-core-init-finalize.c:85
#13 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156
#14 0x00002ae77db9a505 in __libc_start_main () from /lib64/libc.so.6
#15 0x0000000000402fee in _start ()
(gdb)


Thanks,
Cormac.

On Thu, Sep 24, 2020 at 4:12 PM Cormac Garvey <cormac.t.garvey at gmail.com<mailto:cormac.t.garvey at gmail.com>> wrote:
Thanks Rob and Shane.

Here is the gdb traceback attached to one of the hung ior processes (using Darshan LD_PRELOAD).

(gdb) where
#0  0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#1  0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#2  0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#3  0x00002ba2e38d8ffb in MPIR_Barrier_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#4  0x00002ba2e3c8900b in PMPI_Finalize () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#5  0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156
#6  0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6
#7  0x0000000000402fee in _start ()
(gdb)

Thanks for your support,
Cormac.

On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <robl at mcs.anl.gov<mailto:robl at mcs.anl.gov>> wrote:
On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:
> The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs
> correctly if I remove the LD_PRELOAD)

If at all possible, can you attach a debugger to some of the hung IOR
jobs and give us a backtrace?  It will be really helpful to know what
operation these proceses are stuck in.

I don't know your exact environment but attaching to proceses probably
looks like

- ask PBS where your jobs are running
- ssh to a client node
- get the pid of the ior processs on that node
- "gdb -ex where -ex quit -p 1234" (or whatever that process id is)

==rob

>
> When I kill the job, i get the following information in PBS stderr
> file.
>
> "darshan_library_warning: unable to write header to file
> /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-
> 2174583250033636718.darshan_partial."
>
> The IOR benchmark completes (i.e can see the I/O stats), but does not
> appear to exit correctly (remains in a hung state until I kill it)
>
>
> Running a similar job with IOR+mpich or IOR+OpenMPI works fine with
> darshan.
>
> Any ideas what I am missing?
>
> Thanks for your support.
>
> Regards,
> Cormac.
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov<mailto:Darshan-users at lists.mcs.anl.gov>
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200928/097cffa8/attachment.html>


More information about the Darshan-users mailing list