[mpich-discuss] fail to run hello world program with MPICH2-1.3a2 on multiple nodes

Manhui Wang wangm9 at cardiff.ac.uk
Fri Jul 9 10:38:24 CDT 2010


Thanks for your hint.

I asked the system adm to change file /etc/hosts. Now it works fine on
multinodes.

Before we had something like

127.0.0.1       localhost
127.0.0.2       node6-b


Now has been changed into

127.0.0.1       localhost
127.0.0.2       localhost

In the first case, it seems to always loop inside the node.


Thanks,
Manhui

Pavan Balaji wrote:
> 
> This looks like a connection problem between the two nodes. Is there a
> firewall on either of the nodes? If yes, can you disable it?
> 
>  -- Pavan
> 
> On 07/09/2010 06:26 AM, Manhui Wang wrote:
>> Hello,
>>
>> I have a problem about running MPI jobs on multinodes with newly
>> released MPICH2-1.3a2, which hydra is the default process manager.
>>
>> I just tested the simplest hello world program. It works fine on any
>> single node, but fails on multinodes.
>>
>> node6-b:~/testprogram> cat hosts
>> node6-b
>> node6-b
>> node7-b
>> node7-b
>>
>> node6-b:~/testprogram> mpiexec -f hosts -n 4 ./hello
>> node6-b: hello world,length=7,my rank=0
>> node6-b: hello world,length=7,my rank=1
>> node7-b: hello world,length=7,my rank=3
>> node7-b: hello world,length=7,my rank=2
>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>> PMPI_Barrier(476).................: MPI_Barrier(MPI_COMM_WORLD) failed
>> MPIR_Barrier(82)..................:
>> MPIC_Sendrecv(161)................:
>> MPIC_Wait(519)....................:
>> MPIDI_CH3I_Progress(165)..........:
>> MPID_nem_mpich2_blocking_recv(880):
>> MPID_nem_tcp_connpoll(1714).......: Communication error
>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>> PMPI_Barrier(476).................: MPI_Barrier(MPI_COMM_WORLD) failed
>> MPIR_Barrier(82)..................:
>> MPIC_Sendrecv(161)................:
>> MPIC_Wait(519)....................:
>> MPIDI_CH3I_Progress(165)..........:
>> MPID_nem_mpich2_blocking_recv(895):
>> MPID_nem_tcp_connpoll(1714).......: Communication error
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>
>>
>> I built the MPICH2-1.3a2 library with Intel 11.1/069 compilers on 64-bit
>> AMD machine:
>>
>> nice -n +18 ./configure  --with-device=ch3:nemesis
>> --prefix=/mympich2-install FC=ifort --enable-f90 F90=ifort --enable-f77
>> F77=ifort --enable-cc CC=icc --enable-cxx  CXX=icc 2>&1 | tee
>> configure.log
>>
>> nice -n +18 make 2>&1 | tee make.log
>>
>> nice -n +18 make install 2>&1 | tee install.log
>>
>>
>> Could you please point out what is the problem? I have attached the
>> source code.
>>
>> Thanks
>> Manhui
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 

-- 
-----------
Manhui  Wang
School of Chemistry, Cardiff University,
Main Building, Park Place,
Cardiff CF10 3AT, UK
Telephone: +44 (0)29208 76637


More information about the mpich-discuss mailing list