[mpich-discuss] Questions about using valgrind

Fri Feb 18 09:57:10 CST 2011

On Feb 17, 2011, at 2:46 PM CST, Saurabh T wrote:

> I was able to build mpich2 with valgrind using --enable-g=dbg,meminit as suggested on:
> http://wiki.mcs.anl.gov/mpich2/index.php/Support_for_Debugging_Memory_Allocation.
> I have a few questions about this:
> 
> 1. How to actually run valgrind? Is it
> valgrind mpiexec exec
> or
> mpiexec valgrind exec?

You want the latter unless you want to debug mpiexec (hydra) itself.  See Eugene's suggestion if you want the logs for each process separated out to individual files instead of stderr.

> 2. Assuming it is the latter, I noticed that it doesnt report errors from mpich2 itself. It does if mpich2 wasnt built with the --enable-g option. Is this the only benefit enable-g provides? In other words, apart from hiding MPICH2 errors, what are the pros of using enable-g vs using valgrind on the default MPICH2 installation? As you can imagine, there are excellent upsides to maintaining just one MPICH2 installation, so if hiding MPICH2 errors is the only upside of separate installs, I'd look at writing a valgrind suppression file. (Please note that the goal is for valgrind to report errors in the application rather than in MPICH2 itself).

The "dbg" flag causes MPICH2 to be built with "-g" passed to the compiler to add debugging symbols to the resulting binary MPI libraries.  This makes Valgrind stack traces more useful.

The "meminit" flag primarily zeroes out all communication buffers (and similar) that will be passed to a system call in order to avoid warnings from Valgrind about "uninitialized data passed to readv(...)", etc.  These are warnings that are useless to both MPI users and MPICH2 developers and mainly come from compiler-generated padding bytes in structures that are sent over a TCP socket.

There is some small performance cost associated with this option which has not been experimentally quantified.  However, my suspicion is that most applications running on MPICH2 with ch3:nemesis:tcp (the default) won't see any noticeable impact.  The only way for you to be sure, however, is to run your own applications and compare the performance yourself.

As for using a suppression file, that's a legitimate option, but one that I do not prefer.  The suppression file approach requires that we identify all possible places that could elicit a warning, rather than merely identifying the sources of uninitialized data.  I find the latter easier.  Also, because of Valgrind's implementation details, it's usually better for it to have an accurate understanding of the state of memory rather than suppress warnings due to inaccurate understandings.  Large amounts of suppressed warnings can still have overhead inside of Valgrind, at least as of the last time that I looked into it.

-Dave