[mpich-discuss] MPI_Comm_dup problem
Dave Goodell
goodell at mcs.anl.gov
Wed Apr 14 15:28:08 CDT 2010
On Apr 14, 2010, at 3:09 PM, Ingo Bojak wrote:
> this is actually a problem I'm having with MVAPICH2 1.2, but I guess
> it would be OK to ask about that here?
In some cases, yes, but in general you should send MVAPICH/MVAPICH2
questions to their mailing list: http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss/
> I'm running a routine from a parallel library (not my code, no
> access to the source), which in its call requires a pointer to a
> communicator. Everything works fine for many calls, when suddenly
> the program crashes with
>
> Fatal error in MPI_Comm_dup:
> Other MPI error, error stack:
> MPI_Comm_dup(216)..: MPI_Comm_dup(comm=0x84000005,
> new_comm=0x7fffffffdfb0) failed
> MPIR_Comm_copy(655): Too many communicators
>
> which makes we wonder if the library is missing a MPI_Comm_free call
> somewhere? If so, I don't see how I can fix that from the outside.
>
> If someone has a suggestion for a workaround, that would be highly
> appreciated.
A missing MPI_Comm_free sounds like a likely explanation. You could
trace the calls with MPE or some similar profiling tool to figure out
the MPI calls that the library is making. But at the end of the day,
if you can't change the library's behavior, you won't be able to fix
the problem externally without some heroic efforts.
Perhaps you are using the library incorrectly? If the library is
going to the trouble of dup'ing the communicator (good practice for a
parallel library), then they probably have some plan for freeing the
communicator as well.
I would suggest getting in contact with whoever provided the parallel
library.
-Dave
More information about the mpich-discuss
mailing list