[mpich-discuss] MPI_Comm_dup problem

Dave Goodell goodell at mcs.anl.gov
Wed Apr 14 15:28:08 CDT 2010


On Apr 14, 2010, at 3:09 PM, Ingo Bojak wrote:

> this is actually a problem I'm having with MVAPICH2 1.2, but I guess  
> it would be OK to ask about that here?

In some cases, yes, but in general you should send MVAPICH/MVAPICH2  
questions to their mailing list: http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss/

> I'm running a routine from a parallel library (not my code, no  
> access to the source), which in its call requires a pointer to a  
> communicator. Everything works fine for many calls, when suddenly  
> the program crashes with
>
> Fatal error in MPI_Comm_dup:
> Other MPI error, error stack:
> MPI_Comm_dup(216)..: MPI_Comm_dup(comm=0x84000005,  
> new_comm=0x7fffffffdfb0) failed
> MPIR_Comm_copy(655): Too many communicators
>
> which makes we wonder if the library is missing a MPI_Comm_free call  
> somewhere? If so, I don't see how I can fix that from the outside.
>
> If someone has a suggestion for a workaround, that would be highly  
> appreciated.

A missing MPI_Comm_free sounds like a likely explanation.  You could  
trace the calls with MPE or some similar profiling tool to figure out  
the MPI calls that the library is making.  But at the end of the day,  
if you can't change the library's behavior, you won't be able to fix  
the problem externally without some heroic efforts.

Perhaps you are using the library incorrectly?  If the library is  
going to the trouble of dup'ing the communicator (good practice for a  
parallel library), then they probably have some plan for freeing the  
communicator as well.

I would suggest getting in contact with whoever provided the parallel  
library.

-Dave



More information about the mpich-discuss mailing list