[mpich-discuss] -mpe=mpitrace not producing any output, and Valgrind outputs on r3717

Dave Goodell goodell at mcs.anl.gov
Mon Jan 5 13:22:08 CST 2009


Hi François,

Can you boil this down to a small test program that we can use to  
reproduce this?  It's difficult to figure out what's happening without  
a test program.  A small test program should also help you rule out  
user error such as mismatched send/recv calls.

As far as the valgrind output is concerned, make sure that you are  
configuring mpich2 with --enable-g=meminit,dbg (or some superset,  
including --enable-g=all).  Adding "mem" and "handle" to the --enable- 
g list can also help find certain problems.

This page mentions how to specify valgrind include paths in the  
"Valgrind Integration" section: http://wiki.mcs.anl.gov/mpich2/index.php/Support_for_Debugging_Memory_Allocation 
.

Thanks,
-Dave

On Jan 5, 2009, at 4:24 AM, François PELLEGRINI wrote:

>
> Hello and happy new year to all,
>
> Several topics are addressed in this e-mail.
>
> First, I have trouble using the "mpitrace" feature of MPE.
> I compile all of my source code with "mpicc -mpe=mpitrace",
> and link the objects with "-ltmpe", but absolutely no output
> is produced when I run the compiled program. This both happens
> with the official 1.0.8 and r3717 packages. What did I do
> wrong ?
>
>
> When using the r3717 package compiled with Valgrind support
> and running my program on a Linux 32bit system, I get such
> messages when freeing intermediate communicators:
>
> ==26273== Invalid read of size 4
> ==26273==    at 0x8100C25: MPIR_CommL_forget (dbginit.c:317)
> ==26273==    by 0x80C0FD2: MPIR_Comm_release (commutil.c:1073)
> ==26273==    by 0x80B96F0: PMPI_Comm_free (comm_free.c:117)
> ==26273==    by 0x8087673: MPI_Comm_free (trace_mpi_core.c:590)
> [...]
> ==26273==  Address 0x6c16910 is 1,160 bytes inside a block of size  
> 67,740 alloc'd
> ==26273==    at 0x4022ADE: malloc (vg_replace_malloc.c:207)
> ==26273==    by 0x8101A80: MPIU_trmalloc (trmem.c:235)
> ==26273==    by 0x8101F98: MPIU_trcalloc (trmem.c:734)
> ==26273==    by 0x8102583: MPIU_Handle_obj_alloc_unsafe (handlemem.c: 
> 194)
> ==26273==    by 0x80C1EEC: MPIR_Comm_create (commutil.c:100)
> ==26273==    by 0x80C244E: MPIR_Comm_commit (commutil.c:300)
> ==26273==    by 0x80BC794: PMPI_Comm_split (comm_split.c:384)
> ==26273==    by 0x808725D: MPI_Comm_split (trace_mpi_core.c:718)
> [...]
>
> ==26271== Invalid read of size 4
> ==26271==    at 0x8100C25: MPIR_CommL_forget (dbginit.c:317)
> ==26271==    by 0x80C0FD2: MPIR_Comm_release (commutil.c:1073)
> ==26271==    by 0x80C0E9C: MPIR_Comm_release (commutil.c:1044)
> ==26271==    by 0x80B96F0: PMPI_Comm_free (comm_free.c:117)
> ==26271==    by 0x8087673: MPI_Comm_free (trace_mpi_core.c:590)
> [...]
> ==26271==  Address 0x6bcb128 is 632 bytes inside a block of size  
> 67,740 alloc'd
> ==26271==    at 0x4022ADE: malloc (vg_replace_malloc.c:207)
> ==26271==    by 0x8101A80: MPIU_trmalloc (trmem.c:235)
> ==26271==    by 0x8101F98: MPIU_trcalloc (trmem.c:734)
> ==26271==    by 0x8102583: MPIU_Handle_obj_alloc_unsafe (handlemem.c: 
> 194)
> ==26271==    by 0x80C1EEC: MPIR_Comm_create (commutil.c:100)
> ==26271==    by 0x80C33DA: MPIR_Comm_copy (commutil.c:898)
> ==26271==    by 0x80B916D: PMPI_Comm_dup (comm_dup.c:148)
> ==26271==    by 0x808774B: MPI_Comm_dup (trace_mpi_core.c:570)
> [...]
>
> Also, I get many such messages at completion time:
>
> ==25572== Invalid read of size 4
> ==25572==    at 0x8100FF1: MPIU_trdump (trmem.c:581)
> ==25572==    by 0x80DBA9A: PMPI_Finalize (finalize.c:275)
> ==25572==    by 0x8085CEA: MPI_Finalize (trace_mpi_core.c:1265)
> ==25572==    by 0x804A900: main (dgmap.c:380)
> ==25572==  Address 0x6bea818 is 112 bytes inside a block of size 180  
> alloc'd
> ==25572==    at 0x4022ADE: malloc (vg_replace_malloc.c:207)
> ==25572==    by 0x8101A80: MPIU_trmalloc (trmem.c:235)
> ==25572==    by 0x8100D2B: MPIR_Sendq_remember (dbginit.c:244)
> ==25572==    by 0x80E313F: PMPI_Isend (isend.c:128)
> ==25572==    by 0x8084A8E: MPI_Isend (trace_mpi_core.c:1770)
> [...]
>
> ==25920== Invalid read of size 1
> ==25920==    at 0x410E430: _IO_default_xsputn (in /lib/i686/ 
> libc-2.8.so)
> ==25920==    by 0x40E6037: vfprintf (in /lib/i686/libc-2.8.so)
> ==25920==    by 0x40E702F: (within /lib/i686/libc-2.8.so)
> ==25920==    by 0x40E2795: vfprintf (in /lib/i686/libc-2.8.so)
> ==25920==    by 0x40EC27E: fprintf (in /lib/i686/libc-2.8.so)
> ==25920==    by 0x8101025: MPIU_trdump (trmem.c:578)
> ==25920==    by 0x80DBA9A: PMPI_Finalize (finalize.c:275)
> ==25920==    by 0x8085CEA: MPI_Finalize (trace_mpi_core.c:1265)
> ==25920==    by 0x804A900: main (dgmap.c:380)
> ==25920==  Address 0x6b50620 is 64 bytes inside a block of size 180  
> alloc'd
> ==25920==    at 0x4022ADE: malloc (vg_replace_malloc.c:207)
> ==25920==    by 0x8101A80: MPIU_trmalloc (trmem.c:235)
> ==25920==    by 0x8100D2B: MPIR_Sendq_remember (dbginit.c:244)
> ==25920==    by 0x80E313F: PMPI_Isend (isend.c:128)
> ==25920==    by 0x8084A8E: MPI_Isend (trace_mpi_core.c:1770)
> [...]
>
> ==26271== Invalid read of size 4
> ==26271==    at 0x8100C25: MPIR_CommL_forget (dbginit.c:317)
> ==26271==    by 0x80C0FD2: MPIR_Comm_release (commutil.c:1073)
> ==26271==    by 0x80C0E9C: MPIR_Comm_release (commutil.c:1044)
> ==26271==    by 0x8133689: MPID_Finalize (mpid_finalize.c:103)
> ==26271==    by 0x80DB950: PMPI_Finalize (finalize.c:205)
> ==26271==    by 0x8085CEA: MPI_Finalize (trace_mpi_core.c:1265)
> ==26271==    by 0x804A900: main (dgmap.c:380)
> ==26271==  Address 0x81b87f8 is not stack'd, malloc'd or (recently)  
> free'd
>
> ==26273== Invalid read of size 1
> ==26273==    at 0x410E43C: _IO_default_xsputn (in /lib/i686/ 
> libc-2.8.so)
> ==26273==    by 0x40E6037: vfprintf (in /lib/i686/libc-2.8.so)
> ==26273==    by 0x40E702F: (within /lib/i686/libc-2.8.so)
> ==26273==    by 0x40E2795: vfprintf (in /lib/i686/libc-2.8.so)
> ==26273==    by 0x40EC27E: fprintf (in /lib/i686/libc-2.8.so)
> ==26273==    by 0x8101025: MPIU_trdump (trmem.c:578)
> ==26273==    by 0x80DBA9A: PMPI_Finalize (finalize.c:275)
> ==26273==    by 0x8085CEA: MPI_Finalize (trace_mpi_core.c:1265)
> ==26273==    by 0x804A900: main (dgmap.c:380)
> ==26273==  Address 0x45d6dda is 66 bytes inside a block of size 180  
> alloc'd
> ==26273==    at 0x4022ADE: malloc (vg_replace_malloc.c:207)
> ==26273==    by 0x8101A80: MPIU_trmalloc (trmem.c:235)
> ==26273==    by 0x8100D2B: MPIR_Sendq_remember (dbginit.c:244)
> ==26273==    by 0x80E313F: PMPI_Isend (isend.c:128)
> ==26273==    by 0x8084A8E: MPI_Isend (trace_mpi_core.c:1770)
> [...]
>
> I tend to think that all of my communications are matched, but
> these error messages puzzle me. By the way, is there a simple way
> to have MPIch display the list of unmatched communications when it
> releases a communicator ?
>
>
> Finally, still in the r3717, there seem to be many bogus
> Valgrind false positive messages, such as:
>
> ==25570== Conditional jump or move depends on uninitialised value(s)
> ==25570==    at 0x8134E41: MPID_Irecv (mpid_irecv.c:83)
> ==25570==    by 0x80B054E: MPIC_Sendrecv (helper_fns.c:117)
> ==25570==    by 0x8091244: MPIR_Barrier (barrier.c:75)
> ==25570==    by 0x80912D7: MPIR_Barrier_or_coll_fn (barrier.c:242)
> ==25570==    by 0x8091D70: PMPI_Barrier (barrier.c:419)
> ==25570==    by 0x8088614: MPI_Barrier (trace_mpi_core.c:182)
> ==25570==    by 0x8059304: dgraphLoad (dgraph_io_load.c:88)
> ==25570==    by 0x804D830: SCOTCH_dgraphLoad  
> (library_dgraph_io_load.c:100)
> ==25570==    by 0x804A455: main (dgmap.c:285)
>
> ==26271== Conditional jump or move depends on uninitialised value(s)
> ==26271==    at 0x8134E41: MPID_Irecv (mpid_irecv.c:83)
> ==26271==    by 0x80B054E: MPIC_Sendrecv (helper_fns.c:117)
> ==26271==    by 0x80AA36D: MPIR_Allgatherv (allgatherv.c:212)
> ==26271==    by 0x80ABB08: PMPI_Allgatherv (allgatherv.c:1001)
> ==26271==    by 0x8088A19: MPI_Allgatherv (trace_mpi_core.c:80)
> [...]
>
> More intriguing is this one:
>
> ==26271== Conditional jump or move depends on uninitialised value(s)
> ==26271==    at 0x8134E41: MPID_Irecv (mpid_irecv.c:83)
> ==26271==    by 0x80E17E4: PMPI_Irecv (irecv.c:125)
> ==26271==    by 0x8084CEE: MPI_Irecv (trace_mpi_core.c:1713)
> [...]
>
> Well, that's all for today.   :-)
>
> Thanks,
>
>
> 					f.p.




More information about the mpich-discuss mailing list