[mpich-discuss] Fatal error when testing MPICH2 installation

Yaoyu HU huyaoyu1986 at gmail.com
Thu Mar 29 02:47:56 CDT 2012


Hello everyone,

I want to demonstrate MPICH2 on Linux to my friends in their Lab. Since
there are no machines with Linux installed, I decided to run two Linux
systems on two Windows 7 64bit machines with VirtualBox. The guest os I
used is Ubuntu 10.04 LTS 64bit. The two Linux virtual machines are
called 'vnode01' and 'vnode02'.I downloaded the latest MPICH2 source and
compiled the application(using -prefix and --enable-shared while
configure). After the configuration about ssh and NFS, I tried the
example program cpi. 

Now, cpi runs perfectly, on vnode01(master node) or vnode02, but fails
when I am trying to run it in parallel. The information on the terminal
are as follows(all the commands are sent from the terminal on vnode01):

The 'machinefile' I used is:
vnode01:4
vnode02:2

When running cpi only on one node, the corresponding line is saved with
the other line deleted.
==================== cpi runs on vnode01, begings ================
huyaoyu at vnode01:/mirror/mpich2-test$ mpiexec -n 4 -f machinefile ./cpi 
Process 0 of 4 is on vnode01
Process 3 of 4 is on vnode01
Process 2 of 4 is on vnode01
Process 1 of 4 is on vnode01
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.001418
==================== cpi runs on vnode01, ends ====================


==================== cpi runs on vnode02, begings =================
huyaoyu at vnode01:/mirror/mpich2-test$ mpiexec -n 2 -f machinefile ./cpi 
Process 1 of 2 is on vnode02
Process 0 of 2 is on vnode02
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000496
==================== cpi runs on vnode02, ends ====================


================ cpi runs on vnode01 and vnode02, begings ==========
huyaoyu at vnode01:/mirror/mpich2-test$ mpiexec -n 6 -f machinefile ./cpi 
Process 2 of 6 is on vnode01
Process 0 of 6 is on vnode01
Process 3 of 6 is on vnode01
Process 1 of 6 is on vnode01
Fatal error in PMPI_Reduce: Other MPI error, error stack:
PMPI_Reduce(1270)...............: MPI_Reduce(sbuf=0x7fff8304c028,
rbuf=0x7fff8304c020, count=1, MPI_DOUBLE, MPI_SUM, root=0,
MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1087)..........: 
MPIR_Reduce_intra(848)..........: 
MPIR_Reduce_impl(1087)..........: 
MPIR_Reduce_intra(895)..........: 
MPIR_Reduce_binomial(206).......: Failure during collective
MPIR_Reduce_intra(828)..........: 
MPIR_Reduce_impl(1087)..........: 
MPIR_Reduce_intra(895)..........: 
MPIR_Reduce_binomial(144).......: 
MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 4
Process 4 of 6 is on vnode02
Process 5 of 6 is on vnode02

[mpiexec at vnode01] control_cb (./pm/pmiserv/pmiserv_cb.c:321): assert (!
closed) failed
[mpiexec at vnode01] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at vnode01] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at vnode01] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion
================ cpi runs on vnode01 and vnode02, ends ===============


What's wrong with the MPICH2 installation? Does it have anything to do
with the VirtualBox on Windows 7? Or, is it because I have missed
something important, like the fire wall?

I Googled the problem and find one or two massages from the mailing
list. Still, I could not solve the problem.

Thanks!

Yaoyu HU








More information about the mpich-discuss mailing list