[Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Snyder, Shane ssnyder at mcs.anl.gov
Tue Oct 27 12:01:43 CDT 2020


Hi Cormac,

Apologies on taking a while to circle back to this, but I did want to have a closer look before proceeding with our next release.

I was able to find a bug in Darshan's shutdown code that appears to trigger this deadlock you are seeing. The strange thing is that, theoretically, I should have been able to reproduce this using MPICH, which we do the majority of our testing with. But, for some reason MPICH is able to proceed despite the faulty logic. Very strange.

At any rate, I've committed a fix to our master branch, which you are welcome to try out. This will be included in our next Darshan release, which we are hoping to wrap up over the next couple weeks.

Thanks again for the report,
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Snyder, Shane <ssnyder at mcs.anl.gov>
Sent: Monday, September 28, 2020 12:18 PM
To: Cormac Garvey <cormac.t.garvey at gmail.com>; Latham, Robert J. <robl at mcs.anl.gov>
Cc: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Hi Cormac,

Thanks for the additional details! I was able to reproduce this myself using the same mvapich2/darshan versions you provided.

I also tried mvapich2 2.3.4 and hit a similar problem, though the stack traces were slightly different.

FWIW, I did not see these errors when testing 3.1.x versions (e.g., 3.1.8) of Darshan, so you could consider rolling back to an older version if possible. That said, there are a couple of bug fixes in newer versions that might cause you compilation issues for 3.1.x versions, but may be worth trying -- I had to hack around these myself just to test this.

At any rate, I wanted to just provide a quick update that we are looking into the problem and hopefully we have a better understanding of what's going wrong soon.

--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Cormac Garvey <cormac.t.garvey at gmail.com>
Sent: Thursday, September 24, 2020 4:15 PM
To: Latham, Robert J. <robl at mcs.anl.gov>
Cc: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Another process has this traceback.

(gdb) where
#0  0x00002ae77d3b4fe8 in mv2_shm_bcast () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#1  0x00002ae77d3985fe in MPIR_Shmem_Bcast_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#2  0x00002ae77d38f24a in MPIR_Allreduce_two_level_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#3  0x00002ae77d391d01 in MPIR_Allreduce_index_tuned_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#4  0x00002ae77d33a586 in MPIR_Allreduce_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#5  0x00002ae77d72f42f in MPIR_Get_contextid_sparse_group () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#6  0x00002ae77d6c5368 in MPIR_Comm_create_intra () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#7  0x00002ae77d6c56fc in PMPI_Comm_create () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#8  0x00002ae77d6d096d in create_allgather_comm () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#9  0x00002ae77d3b221c in mv2_increment_allgather_coll_counter () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#10 0x00002ae77d33adcf in PMPI_Allreduce () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#11 0x00002ae77cd8e91b in darshan_core_shutdown () at ../darshan-runtime/lib/darshan-core.c:641
#12 0x00002ae77cd89d73 in MPI_Finalize () at ../darshan-runtime/lib/darshan-core-init-finalize.c:85
#13 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156
#14 0x00002ae77db9a505 in __libc_start_main () from /lib64/libc.so.6
#15 0x0000000000402fee in _start ()
(gdb)


Thanks,
Cormac.

On Thu, Sep 24, 2020 at 4:12 PM Cormac Garvey <cormac.t.garvey at gmail.com<mailto:cormac.t.garvey at gmail.com>> wrote:
Thanks Rob and Shane.

Here is the gdb traceback attached to one of the hung ior processes (using Darshan LD_PRELOAD).

(gdb) where
#0  0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#1  0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#2  0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#3  0x00002ba2e38d8ffb in MPIR_Barrier_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#4  0x00002ba2e3c8900b in PMPI_Finalize () from /opt/mvapich2-2.3.3/lib/libmpi.so.12
#5  0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156
#6  0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6
#7  0x0000000000402fee in _start ()
(gdb)

Thanks for your support,
Cormac.

On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <robl at mcs.anl.gov<mailto:robl at mcs.anl.gov>> wrote:
On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:
> The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs
> correctly if I remove the LD_PRELOAD)

If at all possible, can you attach a debugger to some of the hung IOR
jobs and give us a backtrace?  It will be really helpful to know what
operation these proceses are stuck in.

I don't know your exact environment but attaching to proceses probably
looks like

- ask PBS where your jobs are running
- ssh to a client node
- get the pid of the ior processs on that node
- "gdb -ex where -ex quit -p 1234" (or whatever that process id is)

==rob

>
> When I kill the job, i get the following information in PBS stderr
> file.
>
> "darshan_library_warning: unable to write header to file
> /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-
> 2174583250033636718.darshan_partial."
>
> The IOR benchmark completes (i.e can see the I/O stats), but does not
> appear to exit correctly (remains in a hung state until I kill it)
>
>
> Running a similar job with IOR+mpich or IOR+OpenMPI works fine with
> darshan.
>
> Any ideas what I am missing?
>
> Thanks for your support.
>
> Regards,
> Cormac.
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov<mailto:Darshan-users at lists.mcs.anl.gov>
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20201027/c52a27e4/attachment.html>


More information about the Darshan-users mailing list