[mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

Pavan Balaji balaji at mcs.anl.gov
Sun Jun 26 20:12:26 CDT 2011


Hi,

I just tried an equivalent setup on Ubuntu with mpich2-1.4 and it seems 
to work well for me. However, it is still possible that there is an 
networking setup issue on your machines. Can you make sure you can ping 
from each machine in the system to every other machine? (not just from 
the head node to the other nodes; the reverse is required as well).

  -- Pavan

On 06/26/2011 02:50 PM, cornelis.broeders at web.de wrote:
> Hello mpich community,
> having survived the various open source parallelization aids (pvm,
> mpich1-2) with UNIX like OS(SCO, AIX, LINUX since kernel 0.99) with
> successful couplings of different computers, a few days ago I started
> working with mpich2-1.4 on LINUX (MANDRIVA, DEBIAN) to install the
> well-kown code mcnpx on small clusters (at work and at home).
> Parallel calculation on different CPUs on one computer works fine, but
> coupling of two machines fails up till now.
> After quite a lot of efforts to find hints in the internet the current
> situation now is that on my homecluster with MANDRIVA2010.1 on a desktop
> 64 bit dual-core system and MANDRIVA2010.2 on a notebook 32bit dual core
> system the basic testprogram "cpi" runs on both system using a hosts
> file with local computer defined. Trying on both systems to add the
> second one results in very similar error messages. Here the 32bit
> notebook case:
> [inr487 at cblxnbmd2 mpich2-1.4]$ mpiexec -bootstrap ssh examples/cpi
> Process 2 of 4 is on cblxnbmd2
> Process 3 of 4 is on cblxnbmd2
> Process 1 of 4 is on cblxhome
> Process 0 of 4 is on cblxhome
> Fatal error in PMPI_Reduce: Other MPI error, error stack:
> PMPI_Reduce(1270)...............: MPI_Reduce(sbuf=0x7fff36062ab8,
> rbuf=0x7fff36062ab0, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD) failed
> MPIR_Reduce_impl(1087)..........:
> MPIR_Reduce_intra(848)..........:
> MPIR_Reduce_impl(1087)..........:
> MPIR_Reduce_intra(895)..........:
> MPIR_Reduce_binomial(206).......: Failure during collective
> MPIR_Reduce_intra(828)..........:
> MPIR_Reduce_impl(1087)..........:
> MPIR_Reduce_intra(895)..........:
> MPIR_Reduce_binomial(144).......:
> MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 2
>
> The bootstrap part in the command line is the last trial from several
> suggested proposals.
> I tried various ssh configurations working fine without password on the
> command line of both systems, using the ~/.ssh/config mechanism.
> I would very strongly appreciate when somebody could give me hint how to
> couple two computer on a private net 192.168.2.xxx using the current new
> version of mpich2-1.4.
> Thank you very much in advance for any tip,
> C. Broeders
>
> --
> C.H.M. Broeders, http://www.cornelis-broeders.eu
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list