[mpich-discuss] unable to get mpich2 1.0.7 working

Kamaraju Kusumanchi kamaraju at gmail.com
Wed Sep 10 11:18:49 CDT 2008


On Wed, Sep 10, 2008 at 11:55 AM, Kamaraju Kusumanchi
<kamaraju at gmail.com> wrote:
> On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>> We see those errors with asynchronous I/O and Fortran/C++ sometimes but
>> don't know what causes them. For some reason MPD died later on in the tests,
>> which caused the remaining tests to fail. I think your installation is ok.
>> Try running your application.
>>
>> Rajeev
>
>
> I tried to mpdboot a bunch of nodes but it also gives a error.
>
> $cat mpd.hosts
> node2
> node4
> node5
> node8
> node9
> node10
>
> $mpdboot -n 6 -f mpd.hosts
> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
> handshake with mpd on node2; recvd output={}
>
> The node number in the above error message changes. Sometimes it is
> node2, sometime it is node5 etc.,
> But there is nothing wrong with these nodes as such. For example if my
> mpd.hosts contains just 2 lines
>
> node2
> node5
>
> instead of 6 lines, then if I do
>
> $mpdboot -n 2 -f mpd.hosts
>
> then there are no errors.
>
> On each of these nodes (node2, node4, node5, node8, node9, node10) I
> am able to start/stop mpd individually. But the problem arises only
> when I tried to boot the nodes alltogether.
>

Is there any way to disable the asynchronous I/O then?

I also noticed that the problem happens only on a cluster having 64
bit processors (x86_64). I tried compiling an old version of mpich2
(1.0.5p4) on this 64 bit machine and it also exhibits the same
behavior while booting the nodes.

The problem does not seem to be present when I compiled mpich2
(1.0.5p4) on a cluster having 32 bit processors.

thanks
raju




More information about the mpich-discuss mailing list