[mpich-discuss] FW: Re: mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

cornelis.broeders at web.de cornelis.broeders at web.de
Mon Jun 27 16:22:05 CDT 2011


Hello,
in view of my interest to find a solution for my problem, I forward my todays findings to the full group.
Hoping to get solved the problem.
Best greetings,
CB

--
C.H.M. Broeders,       http://www.cornelis-broeders.eu

-----Ursprüngliche Nachricht-----
Von: cornelis.broeders at web.de
Gesendet: Jun 27, 2011 10:47:17 PM
An: "Pavan Balaji" <balaji at mcs.anl.gov>
Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

Hi last time today,
as this problem is a quite strange one, I have spent the past 2 hours with testing on my MANDRIVA based small cluster at home.
After a few cleanups and restarts I have now the feeling that MANDRIVA default sshd settings are not sufficient for the mpich2-1.4 mpiexec.hydra application. After creating an own sshd_config file, probably from Debian resources (copy attached), on both MANDRIVA machines, considering the 32/64 issue by lib/lib64, I obtain the following results:
- hosts handled by HYDRA_HOST_FILE=/home/opt/mpich2-1.4/hosts
- hosts containing either 32bit or 64 bit or both machine names
- mpiexec -n 4 examples/cpi
gives
- proper results for BOTH single machine applications
- shows usual crash if both machines are requested
Any idea how to proceed?
Best greetings,
CB

--
C.H.M. Broeders,       http://www.cornelis-broeders.eu

-----Ursprüngliche Nachricht-----
Von: cornelis.broeders at web.de
Gesendet: Jun 27, 2011 6:51:37 PM
An: "Pavan Balaji" <balaji at mcs.anl.gov>
Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

>Hi again,
>today I did some further analysis concerning this issue and I made a strange observation.
>In another environment at work (FZK), I made a new consistent install on 2 Debian Lenny machines. Installation was done on same directories /home/opt and with same configuration parameter. "box1" is reachable from outside via port 24, "box2" only intranet at FZK
>Doing the test from "box2" to "box1" the cpi test works. However, the opposite "box1" to "box2" testing fails with similar messages as at home. At FZK only entries in ~/.ssh/config are made for proper ssh communication. ssh itself works in both directions without problems.
>Do you have any idea about the reason that "box2" to "box1" works?
>A second question is wheter there is an easy way to apply  for testing another (older) mpiexec mechanism with mpich2-1.4?
>CB
>
>--
>C.H.M. Broeders,       http://www.cornelis-broeders.eu
>
>-----Ursprüngliche Nachricht-----
>Von: cornelis.broeders at web.de
>Gesendet: Jun 27, 2011 11:16:46 AM
>An: "Pavan Balaji" <balaji at mcs.anl.gov>
>Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>
>>Hi,
>>thankyou very much for yout fast reply.
>>As I indicated in my inquiry, I checked nearly everything to my knowledge concerning connection between the machines. I changed several things, but there is a stable passwordless ssh connection for both machines in both directions . For curiosity, I rechecked just  the ping as you suggested. Works fine on both machines.
>>What ssh setup do you use on your ubuntu system?
>>~/.ssh/config, /etc/ssh/sss_config, ports, other changes in standard settings???
>>Many thanks in advance,
>>C. Broeders
>>--
>>C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>
>>-----Ursprüngliche Nachricht-----
>>Von: "Pavan Balaji" <balaji at mcs.anl.gov>
>>Gesendet: Jun 27, 2011 3:12:26 AM
>>An: mpich-discuss at mcs.anl.gov
>>Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>
>>>Hi,
>>>
>>>I just tried an equivalent setup on Ubuntu with mpich2-1.4 and it seems
>>>to work well for me. However, it is still possible that there is an
>>>networking setup issue on your machines. Can you make sure you can ping
>>>from each machine in the system to every other machine? (not just from
>>>the head node to the other nodes; the reverse is required as well).
>>>
>>> -- Pavan
>>>
>>>On 06/26/2011 02:50 PM, cornelis.broeders at web.de wrote:
>>>> Hello mpich community,
>>>> having survived the various open source parallelization aids (pvm,
>>>> mpich1-2) with UNIX like OS(SCO, AIX, LINUX since kernel 0.99) with
>>>> successful couplings of different computers, a few days ago I started
>>>> working with mpich2-1.4 on LINUX (MANDRIVA, DEBIAN) to install the
>>>> well-kown code mcnpx on small clusters (at work and at home).
>>>> Parallel calculation on different CPUs on one computer works fine, but
>>>> coupling of two machines fails up till now.
>>>> After quite a lot of efforts to find hints in the internet the current
>>>> situation now is that on my homecluster with MANDRIVA2010.1 on a desktop
>>>> 64 bit dual-core system and MANDRIVA2010.2 on a notebook 32bit dual core
>>>> system the basic testprogram "cpi" runs on both system using a hosts
>>>> file with local computer defined. Trying on both systems to add the
>>>> second one results in very similar error messages. Here the 32bit
>>>> notebook case:
>>>> [inr487 at cblxnbmd2 mpich2-1.4]$ mpiexec -bootstrap ssh examples/cpi
>>>> Process 2 of 4 is on cblxnbmd2
>>>> Process 3 of 4 is on cblxnbmd2
>>>> Process 1 of 4 is on cblxhome
>>>> Process 0 of 4 is on cblxhome
>>>> Fatal error in PMPI_Reduce: Other MPI error, error stack:
>>>> PMPI_Reduce(1270)...............: MPI_Reduce(sbuf=0x7fff36062ab8,
>>>> rbuf=0x7fff36062ab0, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>>> MPI_COMM_WORLD) failed
>>>> MPIR_Reduce_impl(1087)..........:
>>>> MPIR_Reduce_intra(848)..........:
>>>> MPIR_Reduce_impl(1087)..........:
>>>> MPIR_Reduce_intra(895)..........:
>>>> MPIR_Reduce_binomial(206).......: Failure during collective
>>>> MPIR_Reduce_intra(828)..........:
>>>> MPIR_Reduce_impl(1087)..........:
>>>> MPIR_Reduce_intra(895)..........:
>>>> MPIR_Reduce_binomial(144).......:
>>>> MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 2
>>>>
>>>> The bootstrap part in the command line is the last trial from several
>>>> suggested proposals.
>>>> I tried various ssh configurations working fine without password on the
>>>> command line of both systems, using the ~/.ssh/config mechanism.
>>>> I would very strongly appreciate when somebody could give me hint how to
>>>> couple two computer on a private net 192.168.2.xxx using the current new
>>>> version of mpich2-1.4.
>>>> Thank you very much in advance for any tip,
>>>> C. Broeders
>>>>
>>>> --
>>>> C.H.M. Broeders, http://www.cornelis-broeders.eu
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>--
>>>Pavan Balaji
>>>http://www.mcs.anl.gov/~balaji
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sshd_config_used
Type: application/octet-stream
Size: 237 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110627/5c30e2d8/attachment.obj>


More information about the mpich-discuss mailing list