[mpich-discuss] unable to get mpich2 1.0.7 working

Kamaraju Kusumanchi kamaraju at gmail.com
Wed Sep 10 11:54:43 CDT 2008


I forgot to metion this in my mail. I did do the mpdcheck and verified
that it is not a networking issue.

The problem seems to arise only when I try to boot a lot of nodes (say
6 nodes). But there are no problems if I mpdboot only 2 nodes (any 2
out of the 6). So it is not a networking issue AFAICT.

raju

On Wed, Sep 10, 2008 at 12:32 PM, Anthony Chan <chan at mcs.anl.gov> wrote:
> You may want to try "mpdcheck" to see if there is any network issue.
> mpdcheck is decribed in the Appendix A of the installer's guide.
>
> A.Chan
> ----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:
>
>> On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur <thakur at mcs.anl.gov>
>> wrote:
>> > We see those errors with asynchronous I/O and Fortran/C++ sometimes
>> but
>> > don't know what causes them. For some reason MPD died later on in
>> the tests,
>> > which caused the remaining tests to fail. I think your installation
>> is ok.
>> > Try running your application.
>> >
>> > Rajeev
>>
>>
>> I tried to mpdboot a bunch of nodes but it also gives a error.
>>
>> $cat mpd.hosts
>> node2
>> node4
>> node5
>> node8
>> node9
>> node10
>>
>> $mpdboot -n 6 -f mpd.hosts
>> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
>> handshake with mpd on node2; recvd output={}
>>
>> The node number in the above error message changes. Sometimes it is
>> node2, sometime it is node5 etc.,
>> But there is nothing wrong with these nodes as such. For example if
>> my
>> mpd.hosts contains just 2 lines
>>
>> node2
>> node5
>>
>> instead of 6 lines, then if I do
>>
>> $mpdboot -n 2 -f mpd.hosts
>>
>> then there are no errors.
>>
>> On each of these nodes (node2, node4, node5, node8, node9, node10) I
>> am able to start/stop mpd individually. But the problem arises only
>> when I tried to boot the nodes alltogether.
>>
>> I also can't figure out anything by enabling the debugging messages
>>
>> $cat mpd.hosts
>> node2
>> node4
>> node5
>> node8
>> node9
>> node10
>>
>> $mpdboot -n 6 -f mpd.hosts -v -d
>> debug: starting
>> running mpdallexit on ank.mae.cornell.edu
>> LAUNCHED mpd on ank.mae.cornell.edu  via
>> debug: launch cmd=
>> /home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>   --ncpus=1 -e -d
>> debug: mpd on ank.mae.cornell.edu  on port 46161
>> RUNNING: mpd on ank.mae.cornell.edu
>> debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
>> 'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
>> 'ifhn': ''}
>> LAUNCHED mpd on node2  via  ank.mae.cornell.edu
>> debug: launch cmd= ssh -x -n -q node2
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> LAUNCHED mpd on node4  via  ank.mae.cornell.edu
>> debug: launch cmd= ssh -x -n -q node4
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> LAUNCHED mpd on node5  via  ank.mae.cornell.edu
>> debug: launch cmd= ssh -x -n -q node5
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> LAUNCHED mpd on node8  via  ank.mae.cornell.edu
>> debug: launch cmd= ssh -x -n -q node8
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> debug: mpd on node2  on port 60704
>> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
>> handshake with mpd on node2; recvd output={}
>>
>>
>> Any other suggestions?
>>
>> thanks
>> raju
>
>




More information about the mpich-discuss mailing list