[mpich-discuss] FW: Re: mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

cornelis.broeders at web.de cornelis.broeders at web.de
Tue Jun 28 06:37:59 CDT 2011


Hi,
unfortunately my problem seems not to be of general interest. Nevertheless, I continue reporting on my findings.
Today I continued working on the problem on the 2-machine Debian Lenny cluster environment. After several trials with differen ssh configurations in ~/.ssh/config and in the machine-file hosts I observed the following:
- specifying "-n 4" and using 1 machine in the hosts file allways works without errors
- adding the second machine in hosts crashes on 1 machine allways and on the other one sometimes
- changing on the crashing machine after a crash "-n 4" to "-n 10" suddenly produced correct output
- error with "-n 4" currently not reproduceable
This behaviour is strange and I will try similar tests this evening at at home with my MANDRIVA cluster.
Any hints concerning specific testing?
CB

--
C.H.M. Broeders,       http://www.cornelis-broeders.eu

-----Ursprüngliche Nachricht-----machine
Von: cornelis.broeders at web.de
Gesendet: Jun 27, 2011 11:22:05 PM
An: mpich-discuss at mcs.anl.gov
Betreff: [mpich-discuss] FW: Re: mpich2-1.4 mpiexec(.hydra) problem connecting to second computer

>Hello,
>in view of my interest to find a solution for my problem, I forward my todays findings to the full group.
>Hoping to get solved the problem.
>Best greetings,
>CB
>
>--
>C.H.M. Broeders,       http://www.cornelis-broeders.eu
>
>-----Ursprüngliche Nachricht-----
>Von: cornelis.broeders at web.de
>Gesendet: Jun 27, 2011 10:47:17 PM
>An: "Pavan Balaji" <balaji at mcs.anl.gov>
>Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>
>Hi last time today,
>as this problem is a quite strange one, I have spent the past 2 hours with testing on my MANDRIVA based small cluster at home.
>After a few cleanups and restarts I have now the feeling that MANDRIVA default sshd settings are not sufficient for the mpich2-1.4 mpiexec.hydra application. After creating an own sshd_config file, probably from Debian resources (copy attached), on both MANDRIVA machines, considering the 32/64 issue by lib/lib64, I obtain the following results:
>- hosts handled by HYDRA_HOST_FILE=/home/opt/mpich2-1.4/hosts
>- hosts containing either 32bit or 64 bit or both machine names
>- mpiexec -n 4 examples/cpi
>gives
>- proper results for BOTH single machine applications
>- shows usual crash if both machines are requested
>Any idea how to proceed?
>Best greetings,
>CB
>
>--
>C.H.M. Broeders,       http://www.cornelis-broeders.eu
>
>-----Ursprüngliche Nachricht-----
>Von: cornelis.broeders at web.de
>Gesendet: Jun 27, 2011 6:51:37 PM
>An: "Pavan Balaji" <balaji at mcs.anl.gov>
>Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>
>>Hi again,
>>today I did some further analysis concerning this issue and I made a strange observation.
>>In another environment at work (FZK), I made a new consistent install on 2 Debian Lenny machines. Installation was done on same directories /home/opt and with same configuration parameter. "box1" is reachable from outside via port 24, "box2" only intranet at FZK
>>Doing the test from "box2" to "box1" the cpi test works. However, the opposite "box1" to "box2" testing fails with similar messages as at home. At FZK only entries in ~/.ssh/config are made for proper ssh communication. ssh itself works in both directions without problems.
>>Do you have any idea about the reason that "box2" to "box1" works?
>>A second question is wheter there is an easy way to apply  for testing another (older) mpiexec mechanism with mpich2-1.4?
>>CB
>>
>>--
>>C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>
>>-----Ursprüngliche Nachricht-----
>>Von: cornelis.broeders at web.de
>>Gesendet: Jun 27, 2011 11:16:46 AM
>>An: "Pavan Balaji" <balaji at mcs.anl.gov>
>>Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>
>>>Hi,
>>>thankyou very much for yout fast reply.
>>>As I indicated in my inquiry, I checked nearly everything to my knowledge concerning connection between the machines. I changed several things, but there is a stable passwordless ssh connection for both machines in both directions . For curiosity, I rechecked just  the ping as you suggested. Works fine on both machines.
>>>What ssh setup do you use on your ubuntu system?
>>>~/.ssh/config, /etc/ssh/sss_config, ports, other changes in standard settings???
>>>Many thanks in advance,
>>>C. Broeders
>>>--
>>>C.H.M. Broeders,       http://www.cornelis-broeders.eu
>>>
>>>-----Ursprüngliche Nachricht-----
>>>Von: "Pavan Balaji" <balaji at mcs.anl.gov>
>>>Gesendet: Jun 27, 2011 3:12:26 AM
>>>An: mpich-discuss at mcs.anl.gov
>>>Betreff: Re: [mpich-discuss] mpich2-1.4 mpiexec(.hydra) problem connecting to second computer
>>>
>>>>Hi,
>>>>
>>>>I just tried an equivalent setup on Ubuntu with mpich2-1.4 and it seems
>>>>to work well for me. However, it is still possible that there is an
>>>>networking setup issue on your machines. Can you make sure you can ping
>>>>from each machine in the system to every other machine? (not just from
>>>>the head node to the other nodes; the reverse is required as well).
>>>>
>>>> -- Pavan
>>>>
>>>>On 06/26/2011 02:50 PM, cornelis.broeders at web.de wrote:
>>>>> Hello mpich community,
>>>>> having survived the various open source parallelization aids (pvm,
>>>>> mpich1-2) with UNIX like OS(SCO, AIX, LINUX since kernel 0.99) with
>>>>> successful couplings of different computers, a few days ago I started
>>>>> working with mpich2-1.4 on LINUX (MANDRIVA, DEBIAN) to install the
>>>>> well-kown code mcnpx on small clusters (at work and at home).
>>>>> Parallel calculation on different CPUs on one computer works fine, but
>>>>> coupling of two machines fails up till now.
>>>>> After quite a lot of efforts to find hints in the internet the current
>>>>> situation now is that on my homecluster with MANDRIVA2010.1 on a desktop
>>>>> 64 bit dual-core system and MANDRIVA2010.2 on a notebook 32bit dual core
>>>>> system the basic testprogram "cpi" runs on both system using a hosts
>>>>> file with local computer defined. Trying on both systems to add the
>>>>> second one results in very similar error messages. Here the 32bit
>>>>> notebook case:
>>>>> [inr487 at cblxnbmd2 mpich2-1.4]$ mpiexec -bootstrap ssh examples/cpi
>>>>> Process 2 of 4 is on cblxnbmd2
>>>>> Process 3 of 4 is on cblxnbmd2
>>>>> Process 1 of 4 is on cblxhome
>>>>> Process 0 of 4 is on cblxhome
>>>>> Fatal error in PMPI_Reduce: Other MPI error, error stack:
>>>>> PMPI_Reduce(1270)...............: MPI_Reduce(sbuf=0x7fff36062ab8,
>>>>> rbuf=0x7fff36062ab0, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>>>> MPI_COMM_WORLD) failed
>>>>> MPIR_Reduce_impl(1087)..........:
>>>>> MPIR_Reduce_intra(848)..........:
>>>>> MPIR_Reduce_impl(1087)..........:
>>>>> MPIR_Reduce_intra(895)..........:
>>>>> MPIR_Reduce_binomial(206).......: Failure during collective
>>>>> MPIR_Reduce_intra(828)..........:
>>>>> MPIR_Reduce_impl(1087)..........:
>>>>> MPIR_Reduce_intra(895)..........:
>>>>> MPIR_Reduce_binomial(144).......:
>>>>> MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 2
>>>>>
>>>>> The bootstrap part in the command line is the last trial from several
>>>>> suggested proposals.
>>>>> I tried various ssh configurations working fine without password on the
>>>>> command line of both systems, using the ~/.ssh/config mechanism.
>>>>> I would very strongly appreciate when somebody could give me hint how to
>>>>> couple two computer on a private net 192.168.2.xxx using the current new
>>>>> version of mpich2-1.4.
>>>>> Thank you very much in advance for any tip,
>>>>> C. Broeders
>>>>>
>>>>> --
>>>>> C.H.M. Broeders, http://www.cornelis-broeders.eu
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>>--
>>>>Pavan Balaji
>>>>http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list