[mpich-discuss] MPICH2 + Boost.MPI Collective Problems

Stephan Hackstedt stephan.hackstedt at googlemail.com
Fri Aug 20 12:03:12 CDT 2010


Hi,

I used MPICH2-1.2.1p1 and MPICH2-1.3b1, both builded from source.

Stephan


2010/8/20 Dave Goodell <goodell at mcs.anl.gov>

> Which exact version of MPICH2 are you using?  The binary package for
> Ubuntu?
>
> If plain C/C++ MPI programs are executing OK on your MPICH2 installation,
> then it sounds like a problem with your Boost.MPI.  You should contact the
> Boost.MPI folks for support.
>
> FWIW, the sort of error that you are seeing is usually a symptom of some
> other problem.  For example, another process in the job might be
> segfaulting, but the communication error in a surviving process propagates
> up to the process manager faster than the segfault error on the failing
> process.  Turning on core file creation ("ulimit -c unlimited", plus Google
> around if that doesn't work) sometimes helps for debugging this sort of
> thing.
>
> -Dave
>
> On Aug 20, 2010, at 9:53 AM CDT, Stephan Hackstedt wrote:
>
> > Hi there,
> >
> > i have a big problem by running MPICH2 programs which use the Boost.MPI
> library. When i'm trying to run programs on more than one node, collective
> operations like communicator::barrier, broadcast, or even the environment
> destructor (cause of FINALIZE, which is colletive) causing the programm to
> crash. Maybe its a problem of Boost and the communication cystem, i use
> ch3:nemesis.
> > My errors are like this :
> >
> > [1]terminate called after throwing an instance of
> 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception>
> >'
> >
> > [1]  what():  MPI_Barrier: Other MPI error, error stack:
> > [1]PMPI_Barrier(362).................: MPI_Barrier(MPI_COMM_WORLD) failed
> > [1]MPIR_Barrier_impl(255)............:
> > [1]MPIR_Barrier_intra(79)............:
> > [1]MPIC_Sendrecv(186)................:
> > [1]MPIC_Wait(534)....................:
> > [1]MPIDI_CH3I_Progress(184)..........:
> > [1]MPID_nem_mpich2_blocking_recv(895):
> > [1]MPID_nem_tcp_connpoll(1746).......: Communication error with rank 0:
> >
> > I also tested this with the simple broadcast example from the Boost.MPI
> tutorial - same errors..
> > But when using the original MPI equivalent without the Boost.MPI library,
> such as MPI_Barrier, the programm runs well. I am using MPICH2 on Ubuntu
> 10.04 x86 platforms.
> > Someone had problems like this, or know a fix for that?
> >
> > Regards,
> >
> > Stephan _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100820/f97695d2/attachment.htm>


More information about the mpich-discuss mailing list