[mpich-discuss] FW: Re: mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

Broeders, Cornelis cornelis.broeders at kit.edu
Tue Jun 28 13:43:44 CDT 2011


Hi,
I continued testing after sending the mail below and made further observations. From the very beginning the network was the main candidate for causing the problems. I did various trials in that direction. By testing this afternoon with "hostname" instead of "./examples/cpi" as task and adding the notebook 32 bit MANDRIVA2010.1 to the "cluster", I found that name resolution on all nodes in all directions is a very important issue. Now communication between the 3 cluster machines seems to be OK. I have all nodes in the /etc/hosts file of the machines involved. At the end a new problem appeared; on the 32bi notebook the cpi example is not working properly. Here I am not sure about cooperation of 32bit and 64bit systems. In the coarse of the evening I will do the tests with mixed 32bit and 64bit with my MANDRIVA machines at home.
I have some experience with former mpich applications and can not remember to have encountered such severe name resolution problems. Is this a "feature" of the HYDRA manager? 
Best greetings,
CB

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov on behalf of Pavan Balaji
Sent: Tue 6/28/2011 7:54 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] FW: Re: mpich2-1.4 mpiexec(.hydra) problem connecting to	second computer
 
Hi,

I'm running out of ideas on what might be wrong here. But it almost 
certainly looks like a network setup issue. Can you run mpiexec with the 
-verbose option and send us the output? I don't really expect to find 
much, but it might be worth looking into.

  -- Pavan

On 06/28/2011 06:37 AM, cornelis.broeders at web.de wrote:
> Hi,
> unfortunately my problem seems not to be of general interest. Nevertheless, I continue reporting on my findings.
> Today I continued working on the problem on the 2-machine Debian Lenny cluster environment. After several trials with differen ssh configurations in ~/.ssh/config and in the machine-file hosts I observed the following:
> - specifying "-n 4" and using 1 machine in the hosts file allways works without errors
> - adding the second machine in hosts crashes on 1 machine allways and on the other one sometimes
> - changing on the crashing machine after a crash "-n 4" to "-n 10" suddenly produced correct output
> - error with "-n 4" currently not reproduceable
> This behaviour is strange and I will try similar tests this evening at at home with my MANDRIVA cluster.
> Any hints concerning specific testing?
> CB
>
> --
> C.H.M. Broeders,       http://www.cornelis-broeders.eu
>
> -----Ursprüngliche Nachricht-----machine
> Von: cornelis.broeders at web.de
> Gesendet: Jun 27, 2011 11:22:05 PM
> An: mpich-discuss at mcs.anl.gov
> Betreff: [mpich-discuss] FW: Re: mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>
>> Hello,
>> in view of my interest to find a solution for my problem, I forward my todays findings to the full group.
>> Hoping to get solved the problem.
>> Best greetings,
>> CB
>>
>> --
>> C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>
>> -----Ursprüngliche Nachricht-----
>> Von: cornelis.broeders at web.de
>> Gesendet: Jun 27, 2011 10:47:17 PM
>> An: "Pavan Balaji"<balaji at mcs.anl.gov>
>> Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>
>> Hi last time today,
>> as this problem is a quite strange one, I have spent the past 2 hours with testing on my MANDRIVA based small cluster at home.
>> After a few cleanups and restarts I have now the feeling that MANDRIVA default sshd settings are not sufficient for the mpich2-1.4 mpiexec.hydra application. After creating an own sshd_config file, probably from Debian resources (copy attached), on both MANDRIVA machines, considering the 32/64 issue by lib/lib64, I obtain the following results:
>> - hosts handled by HYDRA_HOST_FILE=/home/opt/mpich2-1.4/hosts
>> - hosts containing either 32bit or 64 bit or both machine names
>> - mpiexec -n 4 examples/cpi
>> gives
>> - proper results for BOTH single machine applications
>> - shows usual crash if both machines are requested
>> Any idea how to proceed?
>> Best greetings,
>> CB
>>
>> --
>> C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>
>> -----Ursprüngliche Nachricht-----
>> Von: cornelis.broeders at web.de
>> Gesendet: Jun 27, 2011 6:51:37 PM
>> An: "Pavan Balaji"<balaji at mcs.anl.gov>
>> Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>
>>> Hi again,
>>> today I did some further analysis concerning this issue and I made a strange observation.
>>> In another environment at work (FZK), I made a new consistent install on 2 Debian Lenny machines. Installation was done on same directories /home/opt and with same configuration parameter. "box1" is reachable from outside via port 24, "box2" only intranet at FZK
>>> Doing the test from "box2" to "box1" the cpi test works. However, the opposite "box1" to "box2" testing fails with similar messages as at home. At FZK only entries in ~/.ssh/config are made for proper ssh communication. ssh itself works in both directions without problems.
>>> Do you have any idea about the reason that "box2" to "box1" works?
>>> A second question is wheter there is an easy way to apply  for testing another (older) mpiexec mechanism with mpich2-1.4?
>>> CB
>>>
>>> --
>>> C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: cornelis.broeders at web.de
>>> Gesendet: Jun 27, 2011 11:16:46 AM
>>> An: "Pavan Balaji"<balaji at mcs.anl.gov>
>>> Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>>
>>>> Hi,
>>>> thankyou very much for yout fast reply.
>>>> As I indicated in my inquiry, I checked nearly everything to my knowledge concerning connection between the machines. I changed several things, but there is a stable passwordless ssh connection for both machines in both directions . For curiosity, I rechecked just  the ping as you suggested. Works fine on both machines.
>>>> What ssh setup do you use on your ubuntu system?
>>>> ~/.ssh/config, /etc/ssh/sss_config, ports, other changes in standard settings???
>>>> Many thanks in advance,
>>>> C. Broeders
>>>> --
>>>> C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: "Pavan Balaji"<balaji at mcs.anl.gov>
>>>> Gesendet: Jun 27, 2011 3:12:26 AM
>>>> An: mpich-discuss at mcs.anl.gov
>>>> Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>>>
>>>>> Hi,
>>>>>
>>>>> I just tried an equivalent setup on Ubuntu with mpich2-1.4 and it seems
>>>>> to work well for me. However, it is still possible that there is an
>>>>> networking setup issue on your machines. Can you make sure you can ping
>>>> >from each machine in the system to every other machine? (not just from
>>>>> the head node to the other nodes; the reverse is required as well).
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> On 06/26/2011 02:50 PM, cornelis.broeders at web.de wrote:
>>>>>> Hello mpich community,
>>>>>> having survived the various open source parallelization aids (pvm,
>>>>>> mpich1-2) with UNIX like OS(SCO, AIX, LINUX since kernel 0.99) with
>>>>>> successful couplings of different computers, a few days ago I started
>>>>>> working with mpich2-1.4 on LINUX (MANDRIVA, DEBIAN) to install the
>>>>>> well-kown code mcnpx on small clusters (at work and at home).
>>>>>> Parallel calculation on different CPUs on one computer works fine, but
>>>>>> coupling of two machines fails up till now.
>>>>>> After quite a lot of efforts to find hints in the internet the current
>>>>>> situation now is that on my homecluster with MANDRIVA2010.1 on a desktop
>>>>>> 64 bit dual-core system and MANDRIVA2010.2 on a notebook 32bit dual core
>>>>>> system the basic testprogram "cpi" runs on both system using a hosts
>>>>>> file with local computer defined. Trying on both systems to add the
>>>>>> second one results in very similar error messages. Here the 32bit
>>>>>> notebook case:
>>>>>> [inr487 at cblxnbmd2 mpich2-1.4]$ mpiexec -bootstrap ssh examples/cpi
>>>>>> Process 2 of 4 is on cblxnbmd2
>>>>>> Process 3 of 4 is on cblxnbmd2
>>>>>> Process 1 of 4 is on cblxhome
>>>>>> Process 0 of 4 is on cblxhome
>>>>>> Fatal error in PMPI_Reduce: Other MPI error, error stack:
>>>>>> PMPI_Reduce(1270)...............: MPI_Reduce(sbuf=0x7fff36062ab8,
>>>>>> rbuf=0x7fff36062ab0, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>>>>> MPI_COMM_WORLD) failed
>>>>>> MPIR_Reduce_impl(1087)..........:
>>>>>> MPIR_Reduce_intra(848)..........:
>>>>>> MPIR_Reduce_impl(1087)..........:
>>>>>> MPIR_Reduce_intra(895)..........:
>>>>>> MPIR_Reduce_binomial(206).......: Failure during collective
>>>>>> MPIR_Reduce_intra(828)..........:
>>>>>> MPIR_Reduce_impl(1087)..........:
>>>>>> MPIR_Reduce_intra(895)..........:
>>>>>> MPIR_Reduce_binomial(144).......:
>>>>>> MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 2
>>>>>>
>>>>>> The bootstrap part in the command line is the last trial from several
>>>>>> suggested proposals.
>>>>>> I tried various ssh configurations working fine without password on the
>>>>>> command line of both systems, using the ~/.ssh/config mechanism.
>>>>>> I would very strongly appreciate when somebody could give me hint how to
>>>>>> couple two computer on a private net 192.168.2.xxx using the current new
>>>>>> version of mpich2-1.4.
>>>>>> Thank you very much in advance for any tip,
>>>>>> C. Broeders
>>>>>>
>>>>>> --
>>>>>> C.H.M. Broeders, http://www.cornelis-broeders.eu
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list