[Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)
Cormac Garvey
cormac.t.garvey at gmail.com
Tue Oct 27 14:29:45 CDT 2020
Thanks Shane for working through this issue and getting back to me, it's
much appreciated.
Regards,
Cormac.
On Tue, Oct 27, 2020 at 12:01 PM Snyder, Shane <ssnyder at mcs.anl.gov> wrote:
> Hi Cormac,
>
> Apologies on taking a while to circle back to this, but I did want to have
> a closer look before proceeding with our next release.
>
> I was able to find a bug in Darshan's shutdown code that appears to
> trigger this deadlock you are seeing. The strange thing is that,
> theoretically, I should have been able to reproduce this using MPICH, which
> we do the majority of our testing with. But, for some reason MPICH is able
> to proceed despite the faulty logic. Very strange.
>
> At any rate, I've committed a fix to our master branch, which you are
> welcome to try out. This will be included in our next Darshan release,
> which we are hoping to wrap up over the next couple weeks.
>
> Thanks again for the report,
> --Shane
> ------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf
> of Snyder, Shane <ssnyder at mcs.anl.gov>
> *Sent:* Monday, September 28, 2020 12:18 PM
> *To:* Cormac Garvey <cormac.t.garvey at gmail.com>; Latham, Robert J. <
> robl at mcs.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3
> (IOR)
>
> Hi Cormac,
>
> Thanks for the additional details! I was able to reproduce this myself
> using the same mvapich2/darshan versions you provided.
>
> I also tried mvapich2 2.3.4 and hit a similar problem, though the stack
> traces were slightly different.
>
> FWIW, I did not see these errors when testing 3.1.x versions (e.g., 3.1.8)
> of Darshan, so you could consider rolling back to an older version if
> possible. That said, there are a couple of bug fixes in newer versions that
> might cause you compilation issues for 3.1.x versions, but may be worth
> trying -- I had to hack around these myself just to test this.
>
> At any rate, I wanted to just provide a quick update that we are looking
> into the problem and hopefully we have a better understanding of what's
> going wrong soon.
>
> --Shane
> ------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf
> of Cormac Garvey <cormac.t.garvey at gmail.com>
> *Sent:* Thursday, September 24, 2020 4:15 PM
> *To:* Latham, Robert J. <robl at mcs.anl.gov>
> *Cc:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3
> (IOR)
>
> Another process has this traceback.
>
> (gdb) where
> #0 0x00002ae77d3b4fe8 in mv2_shm_bcast () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #1 0x00002ae77d3985fe in MPIR_Shmem_Bcast_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #2 0x00002ae77d38f24a in MPIR_Allreduce_two_level_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #3 0x00002ae77d391d01 in MPIR_Allreduce_index_tuned_intra_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #4 0x00002ae77d33a586 in MPIR_Allreduce_impl () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #5 0x00002ae77d72f42f in MPIR_Get_contextid_sparse_group () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #6 0x00002ae77d6c5368 in MPIR_Comm_create_intra () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #7 0x00002ae77d6c56fc in PMPI_Comm_create () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #8 0x00002ae77d6d096d in create_allgather_comm () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #9 0x00002ae77d3b221c in mv2_increment_allgather_coll_counter () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #10 0x00002ae77d33adcf in PMPI_Allreduce () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #11 0x00002ae77cd8e91b in darshan_core_shutdown () at
> ../darshan-runtime/lib/darshan-core.c:641
> #12 0x00002ae77cd89d73 in MPI_Finalize () at
> ../darshan-runtime/lib/darshan-core-init-finalize.c:85
> #13 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized
> out>) at ior.c:156
> #14 0x00002ae77db9a505 in __libc_start_main () from /lib64/libc.so.6
> #15 0x0000000000402fee in _start ()
> (gdb)
>
>
> Thanks,
> Cormac.
>
> On Thu, Sep 24, 2020 at 4:12 PM Cormac Garvey <cormac.t.garvey at gmail.com>
> wrote:
>
> Thanks Rob and Shane.
>
> Here is the gdb traceback attached to one of the hung ior processes (using
> Darshan LD_PRELOAD).
>
> (gdb) where
> #0 0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #1 0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #2 0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #3 0x00002ba2e38d8ffb in MPIR_Barrier_impl () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #4 0x00002ba2e3c8900b in PMPI_Finalize () from
> /opt/mvapich2-2.3.3/lib/libmpi.so.12
> #5 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized
> out>) at ior.c:156
> #6 0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6
> #7 0x0000000000402fee in _start ()
> (gdb)
>
> Thanks for your support,
> Cormac.
>
> On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <robl at mcs.anl.gov>
> wrote:
>
> On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:
> > The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs
> > correctly if I remove the LD_PRELOAD)
>
> If at all possible, can you attach a debugger to some of the hung IOR
> jobs and give us a backtrace? It will be really helpful to know what
> operation these proceses are stuck in.
>
> I don't know your exact environment but attaching to proceses probably
> looks like
>
> - ask PBS where your jobs are running
> - ssh to a client node
> - get the pid of the ior processs on that node
> - "gdb -ex where -ex quit -p 1234" (or whatever that process id is)
>
> ==rob
>
> >
> > When I kill the job, i get the following information in PBS stderr
> > file.
> >
> > "darshan_library_warning: unable to write header to file
> > /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-
> > 2174583250033636718.darshan_partial."
> >
> > The IOR benchmark completes (i.e can see the I/O stats), but does not
> > appear to exit correctly (remains in a hung state until I kill it)
> >
> >
> > Running a similar job with IOR+mpich or IOR+OpenMPI works fine with
> > darshan.
> >
> > Any ideas what I am missing?
> >
> > Thanks for your support.
> >
> > Regards,
> > Cormac.
> > _______________________________________________
> > Darshan-users mailing list
> > Darshan-users at lists.mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20201027/99f1ef74/attachment.html>
More information about the Darshan-users
mailing list