[Darshan-users] Darshan v3.2.1 hangs with mvapich2 2.3.3 (IOR)

Cormac Garvey cormac.t.garvey at gmail.com
Thu Sep 24 16:12:16 CDT 2020


Thanks Rob and Shane.

Here is the gdb traceback attached to one of the hung ior processes (using
Darshan LD_PRELOAD).

(gdb) where
#0  0x00002ba2e394fc8e in MPIDI_CH3I_SHMEM_COLL_Barrier_gather () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#1  0x00002ba2e393567b in MPIR_Barrier_intra_MV2 () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#2  0x00002ba2e39358c9 in MPIR_Barrier_MV2 () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#3  0x00002ba2e38d8ffb in MPIR_Barrier_impl () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#4  0x00002ba2e3c8900b in PMPI_Finalize () from
/opt/mvapich2-2.3.3/lib/libmpi.so.12
#5  0x000000000040cb96 in ior_main (argc=<optimized out>, argv=<optimized
out>) at ior.c:156
#6  0x00002ba2e4138505 in __libc_start_main () from /lib64/libc.so.6
#7  0x0000000000402fee in _start ()
(gdb)

Thanks for your support,
Cormac.

On Tue, Sep 22, 2020 at 2:35 PM Latham, Robert J. <robl at mcs.anl.gov> wrote:

> On Fri, 2020-09-18 at 13:31 -0500, Cormac Garvey wrote:
> > The IOR job hangs at the end if I export the darshan LD_PRELOAD (Runs
> > correctly if I remove the LD_PRELOAD)
>
> If at all possible, can you attach a debugger to some of the hung IOR
> jobs and give us a backtrace?  It will be really helpful to know what
> operation these proceses are stuck in.
>
> I don't know your exact environment but attaching to proceses probably
> looks like
>
> - ask PBS where your jobs are running
> - ssh to a client node
> - get the pid of the ior processs on that node
> - "gdb -ex where -ex quit -p 1234" (or whatever that process id is)
>
> ==rob
>
> >
> > When I kill the job, i get the following information in PBS stderr
> > file.
> >
> > "darshan_library_warning: unable to write header to file
> > /share/home/hpcuser/darshan_logs/hpcuser_ior_id26_9-18-65156-
> > 2174583250033636718.darshan_partial."
> >
> > The IOR benchmark completes (i.e can see the I/O stats), but does not
> > appear to exit correctly (remains in a hung state until I kill it)
> >
> >
> > Running a similar job with IOR+mpich or IOR+OpenMPI works fine with
> > darshan.
> >
> > Any ideas what I am missing?
> >
> > Thanks for your support.
> >
> > Regards,
> > Cormac.
> > _______________________________________________
> > Darshan-users mailing list
> > Darshan-users at lists.mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200924/08b80976/attachment.html>


More information about the Darshan-users mailing list