<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Cormac,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks for the additional details! I was able to reproduce this myself using the same mvapich2/darshan versions you provided.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I also tried mvapich2 2.3.4 and hit a similar problem, though the stack traces were slightly different.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
FWIW, I did not see these errors when testing 3.1.x versions (e.g., 3.1.8) of Darshan, so you could consider rolling back to an older version if possible. That said, there are a couple of bug fixes in newer versions that might cause you compilation issues for
3.1.x versions, but may be worth trying -- I had to hack around these myself just to test this.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
At any rate, I wanted to just provide a quick update that we are looking into the problem and hopefully we have a better understanding of what's going wrong soon.
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
--Shane<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Darshan-users <darshan-users-bounces@lists.mcs.anl.gov> on behalf of Cormac Garvey <cormac.t.garvey@gmail.com><br>
<b>Sent:</b> Thursday, September 24, 2020 4:15 PM<br>
<b>To:</b> Latham, Robert J. <robl@mcs.anl.gov><br>
<b>Cc:</b> darshan-users@lists.mcs.anl.gov <darshan-users@lists.mcs.anl.gov><br>
<b>Subject:</b> Re: [Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)</font>
<div> </div>
</div>
<div>
<div dir="ltr">Another process has this traceback.
<div><br>
</div>
<div>(gdb) where<br>
#0 0x00002ae77d3b4fe8 in mv2_shm_bcast () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#1 0x00002ae77d3985fe in MPIR_Shmem_Bcast_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#2 0x00002ae77d38f24a in MPIR_Allreduce_two_level_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#3 0x00002ae77d391d01 in MPIR_Allreduce_index_tuned_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#4 0x00002ae77d33a586 in MPIR_Allreduce_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#5 0x00002ae77d72f42f in MPIR_Get_contextid_sparse_group () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#6 0x00002ae77d6c5368 in MPIR_Comm_create_intra () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#7 0x00002ae77d6c56fc in PMPI_Comm_create () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#8 0x00002ae77d6d096d in create_allgather_comm () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#9 0x00002ae77d3b221c in mv2_increment_allgather_coll_counter () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#10 0x00002ae77d33adcf in PMPI_Allreduce () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#11 0x00002ae77cd8e91b in darshan_core_shutdown () at ../darshan-runtime/lib/darshan-core.c:641<br>
#12 0x00002ae77cd89d73 in MPI_Finalize () at ../darshan-runtime/lib/darshan-core-init-finalize.c:85<br>
#13 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156<br>
#14 0x00002ae77db9a505 in __libc_start_main () from /lib64/libc.so.6<br>
#15 0x0000000000402fee in _start ()<br>
(gdb)<br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Cormac.</div>
</div>
<br>
<div class="x_gmail_quote">
<div dir="ltr" class="x_gmail_attr">On Thu, Sep 24, 2020 at 4:12 PM Cormac Garvey <<a href="mailto:cormac.t.garvey@gmail.com">cormac.t.garvey@gmail.com</a>> wrote:<br>
</div>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px 0.8ex; border-left:1px solid rgb(204,204,204); padding-left:1ex">
<div dir="ltr">Thanks Rob and Shane.
<div><br>
</div>
<div>Here is the gdb traceback attached to one of the hung ior processes (using Darshan LD_PRELOAD).</div>
<div><br>
</div>
<div>(gdb) where<br>
#0 0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#1 0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#2 0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#3 0x00002ba2e38d8ffb in MPIR_Barrier_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#4 0x00002ba2e3c8900b in PMPI_Finalize () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>
#5 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156<br>
#6 0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6<br>
#7 0x0000000000402fee in _start ()<br>
(gdb)<br>
</div>
<div><br>
</div>
<div>Thanks for your support,</div>
<div>Cormac.</div>
</div>
<br>
<div class="x_gmail_quote">
<div dir="ltr" class="x_gmail_attr">On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <<a href="mailto:robl@mcs.anl.gov" target="_blank">robl@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px 0.8ex; border-left:1px solid rgb(204,204,204); padding-left:1ex">
On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:<br>
> The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs<br>
> correctly if I remove the LD_PRELOAD)<br>
<br>
If at all possible, can you attach a debugger to some of the hung IOR<br>
jobs and give us a backtrace? It will be really helpful to know what<br>
operation these proceses are stuck in.<br>
<br>
I don't know your exact environment but attaching to proceses probably<br>
looks like<br>
<br>
- ask PBS where your jobs are running<br>
- ssh to a client node<br>
- get the pid of the ior processs on that node<br>
- "gdb -ex where -ex quit -p 1234" (or whatever that process id is)<br>
<br>
==rob<br>
<br>
> <br>
> When I kill the job, i get the following information in PBS stderr<br>
> file.<br>
> <br>
> "darshan_library_warning: unable to write header to file<br>
> /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-<br>
> 2174583250033636718.darshan_partial."<br>
> <br>
> The IOR benchmark completes (i.e can see the I/O stats), but does not<br>
> appear to exit correctly (remains in a hung state until I kill it)<br>
> <br>
> <br>
> Running a similar job with IOR+mpich or IOR+OpenMPI works fine with<br>
> darshan.<br>
> <br>
> Any ideas what I am missing?<br>
> <br>
> Thanks for your support.<br>
> <br>
> Regards,<br>
> Cormac.<br>
> _______________________________________________<br>
> Darshan-users mailing list<br>
> <a href="mailto:Darshan-users@lists.mcs.anl.gov" target="_blank">Darshan-users@lists.mcs.anl.gov</a><br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/darshan-users" rel="noreferrer" target="_blank">
https://lists.mcs.anl.gov/mailman/listinfo/darshan-users</a><br>
</blockquote>
</div>
</blockquote>
</div>
</div>
</body>
</html>