[mpich-discuss] Fatal error in PMPI_Bcast:

Fujun Liu liufujun07 at gmail.com
Fri May 27 11:24:38 CDT 2011


I use two hosts: one is query, the other is trigger

(1) about firewall

netlab at query:~$ sudo ufw status
Status: inactive

netlab at trigger:~$ sudo ufw status
Status: inactive

Both firewalls are turned off.

(2)about DNS

for query, /etc/hosts is as below:

127.0.0.1       localhost
#127.0.1.1      query

xxx.xxx.xxx.42  trigger
xxx.xxx.xxx.43  query

for trigger, /etc/hosts is as below:
127.0.0.1       localhost
#127.0.1.1      trigger

xxx.xxx.xxx.42  trigger
xxx.xxx.xxx.43  query

In fact, they are the same

(3) version of MPICH2

mpich2-1.3.2p1, it is from
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
As you can notice, it is called stable version

(4) about configure.

I did nothing about this. I just use the -prefix option. Do I need more
about this?

Now hellowworld workds fine on two hosts, cpi works fine on single one host.
The problem is probably that the two hosts can't communicate. So any
suggestion?

Best Wishes,

On Fri, May 27, 2011 at 11:55 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:

> The problem looks like a networking issue, either a firewall or DNS (bad
> /etc/hosts file?) issue.  Are the firewalls disabled on these machines?  How
> are the hostnames configured?
>
> What version of MPICH2 is this?  What configure options did you use when
> you built MPICH2?
>
> -Dave
>
> On May 27, 2011, at 10:49 AM CDT, Fujun Liu wrote:
>
> > The cpi also does not work. There is no error message, but it takes
> forever:
> >
> > xxxx at query:~/MPI$ mpiexec -n 2 -f machinefile
> /home/netlab/MPI/mpich2-build/examples/cpi
> > Process 1 of 2 is on query
> > Process 0 of 2 is on trigger
> >
> > I think my two hosts are still trying to communicate to each other. Any
> suggestions?
> >
> > Best wishes,
> >
> >
> > On Fri, May 27, 2011 at 9:42 AM, Dave Goodell <goodell at mcs.anl.gov>
> wrote:
> > Does the "examples/cpi" program from the MPICH2 build directory work
> correctly for you when you run it on multiple nodes?
> >
> > -Dave
> >
> > On May 26, 2011, at 5:49 PM CDT, Fujun Liu wrote:
> >
> > > Hi everyone,
> > >
> > > When I try one example from
> http://beige.ucs.indiana.edu/I590/node62.html, I got the following error
> message as below. In the MPI cluster, there are two hosts. If I run the two
> processes on just one host, everything works fine. But if I run two
> processes on the two-host cluster, the following error happens. I think the
> two hosts just can't send/receive message to each other, but I don't know
> how to resolve this.
> > >
> > > Thanks in advance!
> > >
> > > xxxx at query:~/MPI$ mpiexec -n 2 -f machinefile ./GreetMaster
> > > Fatal error in PMPI_Bcast: Other MPI error, error stack:
> > > PMPI_Bcast(1430).......................: MPI_Bcast(buf=0x7fff13114cb0,
> count=8192, MPI_CHAR, root=0, MPI_COMM_WORLD) failed
> > > MPIR_Bcast_impl(1273)..................:
> > > MPIR_Bcast_intra(1107).................:
> > > MPIR_Bcast_binomial(143)...............:
> > > MPIC_Recv(110).........................:
> > > MPIC_Wait(540).........................:
> > > MPIDI_CH3I_Progress(353)...............:
> > > MPID_nem_mpich2_blocking_recv(905).....:
> > > MPID_nem_tcp_connpoll(1823)............:
> > > state_commrdy_handler(1665)............:
> > > MPID_nem_tcp_recv_handler(1559)........:
> > > MPID_nem_handle_pkt(587)...............:
> > > MPIDI_CH3_PktHandler_EagerSend(632)....: failure occurred while posting
> a receive for message data (MPIDI_CH3_PKT_EAGER_SEND)
> > > MPIDI_CH3U_Receive_data_unexpected(251): Out of memory (unable to
> allocate -1216907051 bytes)
> > > [mpiexec at query] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
> > > APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
> > >
> > > --
> > > Fujun Liu
> > > Department of Computer Science, University of Kentucky, 2010.08-
> > > fujun.liu at uky.edu, (859)229-3659
> > >
> > >
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list
> > > mpich-discuss at mcs.anl.gov
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> >
> > --
> > Fujun Liu
> > Department of Computer Science, University of Kentucky, 2010.08-
> > fujun.liu at uky.edu, (859)229-3659
> >
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Fujun Liu
Department of Computer Science, University of Kentucky, 2010.08-
fujun.liu at uky.edu, (859)229-3659
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110527/0987742f/attachment.htm>


More information about the mpich-discuss mailing list