[MPICH] EOF from console

Matthew Fowler tjue1 at central.susx.ac.uk
Thu Jun 15 08:25:59 CDT 2006


Hello Ralph, thankyou for the suggestions

The problem arrises when any board is added an already existing cluster. 
Actually say five was too specific. After experimenting for a while long 
I have found that sometimes the same output appears when adding a fourth 
to a cluster of three. I also thought that maybe the problem was with 
the fith board but as it happens the problem can be reproduced when 
board05 is the first to enter the ring, then 4, 3, 2, 1 (eof from console).

Just to make sure that my setup is correct would you please look over 
the output from mpdcheck -v below:

board01:

# mpdcheck -v
mpdcheck -v
obtaining hostname via gethostname and getfqdn
gethostname gives  board01
getfqdn gives  board01
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure 
other than 127.0.0.1
gethostbyname_ex:  ('board01', [], ['10.9.10.1'])
gethostbyname_ex:  ('board01', [], ['10.9.10.1'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
#

board05:

# mpdcheck -v
mpdcheck -v
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
obtaining hostname via gethostname and getfqdn
gethostname gives  board05
getfqdn gives  board05
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure 
other than 127.0.0.1
gethostbyname_ex:  ('board05', [], ['10.9.10.5'])
gethostbyname_ex:  ('board05', [], ['10.9.10.5'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
#

Best Regards

Matthew



Ralph M. Butler wrote:

> Hi Matthew:
>
> mpdtrace is one of the mpd "console" programs.  It does not wait 
> forever for an
> mpdp to respond.  This error msg is telling us that there was no 
> response from
> the mpd.  mpdtrace does not know why of course.  So, we have to assume 
> that the
> mpd was (becaame) inaccessible for some reason.
>
> It is not clear if the problem is adding a fifth or the particular fifth
> being added.  I will assume that problem always arises adding in a host
> named board05.  Can you reproduce the problem by starting an mpd on
> board01 and then immediately adding in board05?  If this reproduces the
> problem, it implies that there is some issue with board05 itself. In
> that case, I would suggest using mpdcheck on board05 as discussed in the
> Troubleshooting portion of the install manual.
>
> --ralph
>
> On Mon, 12 Jun 2006, tjue1 at sussex.ac.uk wrote:
>
>> Date: Mon, 12 Jun 2006 16:49:08 +0100
>> From: tjue1 at sussex.ac.uk
>> To: beowulf at beowulf.org, mpich-discuss at mcs.anl.gov
>> Subject: [MPICH] EOF from console
>>
>> Hi list
>>
>> Im doing some experiments on an embedded platform and am building a
>> Beowulf cluster from them. I have a unusual setup as the boards have
>> limited memory and i am using MPICH 2 (latest). The setup is a bit
>> strange as Python is accessable to the boards via an NFS mount.
>>
>> I can start an MPD daemon on a single board with no problems. I can
>> also add a further three to the ring with no probs. Adding a fith
>> causes an error. (see below)
>>
>> (im adding nodes manually rather than using mpdboot. When I get it
>> working manually I will get mpdboot working.
>>
>> Heres the problem:
>>
>> (from first board)
>>
>> mpdtrace -l
>> board01_2048 (10.9.10.1)
>>
>> I then add others into the ring as:
>>
>> mpd -h board01 -p 2048 &
>>
>> mpdtrace
>> board02
>> board01
>>
>> I can continue to add boards until I try and add a 5th. When adding a
>> 5th using the above method I get:
>>
>> mpdtrace &
>> mpdtrace (mpdtrace 57): got eof on console
>> Jul 22 08:33:49 board05 python2.3: mpdtrace (mpdtrace 57): got eof on
>> console
>>
>> I have to admit im baffled. Can anyone shed some light on this? If more
>> specific information will help please tell me.
>>
>> Regards
>>
>> Matthew
>




More information about the mpich-discuss mailing list