<div dir="ltr">Another process has this traceback.<div><br></div><div>(gdb) where<br>#0 0x00002ae77d3b4fe8 in mv2_shm_bcast () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#1 0x00002ae77d3985fe in MPIR_Shmem_Bcast_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#2 0x00002ae77d38f24a in MPIR_Allreduce_two_level_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#3 0x00002ae77d391d01 in MPIR_Allreduce_index_tuned_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#4 0x00002ae77d33a586 in MPIR_Allreduce_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#5 0x00002ae77d72f42f in MPIR_Get_contextid_sparse_group () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#6 0x00002ae77d6c5368 in MPIR_Comm_create_intra () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#7 0x00002ae77d6c56fc in PMPI_Comm_create () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#8 0x00002ae77d6d096d in create_allgather_comm () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#9 0x00002ae77d3b221c in mv2_increment_allgather_coll_counter () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#10 0x00002ae77d33adcf in PMPI_Allreduce () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#11 0x00002ae77cd8e91b in darshan_core_shutdown () at ../darshan-runtime/lib/darshan-core.c:641<br>#12 0x00002ae77cd89d73 in MPI_Finalize () at ../darshan-runtime/lib/darshan-core-init-finalize.c:85<br>#13 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156<br>#14 0x00002ae77db9a505 in __libc_start_main () from /lib64/libc.so.6<br>#15 0x0000000000402fee in _start ()<br>(gdb)<br></div><div><br></div><div><br></div><div>Thanks,</div><div>Cormac.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 24, 2020 at 4:12 PM Cormac Garvey <<a href="mailto:cormac.t.garvey@gmail.com">cormac.t.garvey@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thanks Rob and Shane.<div><br></div><div>Here is the gdb traceback attached to one of the hung ior processes (using Darshan LD_PRELOAD).</div><div><br></div><div>(gdb) where<br>#0 0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#1 0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#2 0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#3 0x00002ba2e38d8ffb in MPIR_Barrier_impl () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#4 0x00002ba2e3c8900b in PMPI_Finalize () from /opt/mvapich2-2.3.3/lib/libmpi.so.12<br>#5 0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized out>) at ior.c:156<br>#6 0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6<br>#7 0x0000000000402fee in _start ()<br>(gdb)<br></div><div><br></div><div>Thanks for your support,</div><div>Cormac.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <<a href="mailto:robl@mcs.anl.gov" target="_blank">robl@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:<br>
> The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs<br>
> correctly if I remove the LD_PRELOAD)<br>
<br>
If at all possible, can you attach a debugger to some of the hung IOR<br>
jobs and give us a backtrace? It will be really helpful to know what<br>
operation these proceses are stuck in.<br>
<br>
I don't know your exact environment but attaching to proceses probably<br>
looks like<br>
<br>
- ask PBS where your jobs are running<br>
- ssh to a client node<br>
- get the pid of the ior processs on that node<br>
- "gdb -ex where -ex quit -p 1234" (or whatever that process id is)<br>
<br>
==rob<br>
<br>
> <br>
> When I kill the job, i get the following information in PBS stderr<br>
> file.<br>
> <br>
> "darshan_library_warning: unable to write header to file<br>
> /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-<br>
> 2174583250033636718.darshan_partial."<br>
> <br>
> The IOR benchmark completes (i.e can see the I/O stats), but does not<br>
> appear to exit correctly (remains in a hung state until I kill it)<br>
> <br>
> <br>
> Running a similar job with IOR+mpich or IOR+OpenMPI works fine with<br>
> darshan.<br>
> <br>
> Any ideas what I am missing?<br>
> <br>
> Thanks for your support.<br>
> <br>
> Regards,<br>
> Cormac.<br>
> _______________________________________________<br>
> Darshan-users mailing list<br>
> <a href="mailto:Darshan-users@lists.mcs.anl.gov" target="_blank">Darshan-users@lists.mcs.anl.gov</a><br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/darshan-users" rel="noreferrer" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/darshan-users</a><br>
</blockquote></div>
</blockquote></div>