[mpich-discuss] Need some help getting mpich to work

Pavan Balaji balaji at mcs.anl.gov
Wed Dec 2 16:41:38 CST 2009


Are you using mpich2-1.2.1 ?

% mpiexec.hydra -info

Can you try running Hydra in verbose mode to see what's going on?

% mpiexec.hydra -verbose -f /home/su/mpd.hosts -n 124 hostname

 -- Pavan

On 12/02/2009 04:39 PM, Hung-Hsun Su wrote:
> We are using the basic NSF system. It's not the MPI-I/O that I am
> worrying about, its the collective that worries me. I have a few MPI
> programs and they all died when any sort of collective is used (few give
> the error of overlapping memory space).  Any program that doesn't use
> collective works just fine.
> 
> I've tried the "mpiexec.hydra -f /home/su/mpd.hosts -n 124 hostname |
> sort | uniq -c | sort -n" command and it basically just hangs there so I
> presume the issue is with something deeper.
> 
> I did "cluster-fork mpdcheck -v" and the output seems to be correct,
> here is a part of the output I am getting.
> [su at alpha ~]$ cluster-fork mpdcheck -v
> compute-0-0:
> obtaining hostname via gethostname and getfqdn
> gethostname gives  compute-0-0.local
> getfqdn gives  compute-0-0.local
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames;  make sure
> other than 127.0.0.1
> gethostbyname_ex:  ('compute-0-0.local', ['compute-0-0'],
> ['10.255.255.254'])
> gethostbyname_ex:  ('compute-0-0.local', ['compute-0-0'],
> ['10.255.255.254'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
> compute-0-1:
> obtaining hostname via gethostname and getfqdn
> gethostname gives  compute-0-1.local
> getfqdn gives  compute-0-1.local
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames;  make sure
> other than 127.0.0.1
> gethostbyname_ex:  ('compute-0-1.local', ['compute-0-1'],
> ['10.255.255.253'])
> gethostbyname_ex:  ('compute-0-1.local', ['compute-0-1'],
> ['10.255.255.253'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
> ...
> 
> I've also tried older version of gcc (4.2.4 and 4.3.4 and they all does
> the same thing. LD_LIBRARY_PATH=gcc/lib64 and PATH=gcc/bin added) so at
> this point I am not really sure what is the issue. Do you think this
> could be hardware issue  rather than software? Is there any way to test
> this? (I've tried mpdringtest and mpdcheck and it produces no error)
> 
> Hung-Hsun
>> Your new "make testing" run is having problems with the MPI-I/O
>> tests.  What sort of file system are you running this on?
>>
>> As for the mpdboot problems, I'm not sure what is happening.  I would
>> attempt to use the mpdcheck tool described in Appendix A of the
>> Installer's Guide [1] to diagnose the problem.
>>
>> You might also be having trouble because you are running part of your
>> jobs on the cluster head node.  MPD's mpiexec will attempt to run 1
>> process locally first unless passed the "-1" option.  Unfortunately
>> there is no easy way to pass that option to the "make testing" process.
>>
>> You can also try using hydra instead of MPD.  You should just be able
>> to run "mpiexec.hydra -f /home/su/mpd.hosts -n 124 hostname | sort |
>> uniq -c | sort -n" to sanity check that it works (you should get 31
>> lines, each starting with 4).  If it does work for you, you can
>> rebuild MPICH2 with "--enable-pm=hydra,mpd" to make hydra the default
>> mpiexec.  Hydra will use the hostfile specified on the command line
>> but it will also look at the file specified by the $HYDRA_HOST_FILE
>> environment variable.  See [2] for more information on using hydra.
>>
>> -Dave
>>
>> [1]
>> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.2-installguide.pdf
>>
>> [2]
>> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
>>
>> On Nov 24, 2009, at 12:50 PM, Hung-Hsun Su wrote:
>>
>>> Unfortunately, the latest release did not solve the issue. It
>>> actually introduces new bug. After installation, mpdboot cannot setup
>>> the environment correctly. It freezes when I make the following call.
>>>
>>> [su at alpha ~]$ which mpdboot
>>> /home/su/software/mpich2-1.2.1/bin/mpdboot
>>> [su at alpha ~]$ mpdboot -n 32 --ncpus=1 -f /home/su/mpd.hosts
>>>
>>> After I killed the process and found out that only 4/31 of the
>>> compute nodes were setup
>>>
>>> [su at alpha ~]$ mpdtrace
>>> alpha
>>> compute-0-3
>>> compute-0-1
>>> compute-0-0
>>> compute-0-2
>>>
>>> I then tried setting up using the 1.2 version and it works fine.
>>>
>>> [su at alpha ~]$ /home/su/software/mpich2-1.2/bin/mpdboot -n 32
>>> --ncpus=1 -f /home/su/mpd.hosts
>>> [su at alpha ~]$ mpdtrace
>>> alpha
>>> compute-0-3
>>> compute-0-11
>>> compute-0-10
>>> compute-0-9
>>> compute-0-8
>>> compute-0-1
>>> compute-0-15
>>> compute-0-14
>>> compute-0-13
>>> compute-0-12
>>> compute-0-0
>>> compute-0-19
>>> compute-0-27
>>> compute-0-26
>>> compute-0-25
>>> compute-0-24
>>> compute-0-18
>>> compute-0-30
>>> compute-0-29
>>> compute-0-28
>>> compute-0-17
>>> compute-0-16
>>> compute-0-2
>>> compute-0-7
>>> compute-0-6
>>> compute-0-5
>>> compute-0-4
>>> compute-0-23
>>> compute-0-22
>>> compute-0-21
>>> compute-0-20
>>>
>>> I then tried ran make testing and got even more error. Anyone has an
>>> idea of what is going on?
>>>
>>> Hung-Hsun
>>>
>>> PS. I've attached the output files from various steps
>>> c.txt - configuration
>>> m.txt - make
>>> mi.txt - make install
>>> mpd.hosts - my machine file
>>> mtest.txt - make testing
>>> summary.xml - output in test/mpi directory from make testing
>>>
>>>> Can you try the latest release 1.2.1?
>>>>
>>>> Rajeev
>>>>
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Hung-Hsun Su
>>>>> Sent: Monday, November 23, 2009 11:16 AM
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Subject: [mpich-discuss] Need some help getting mpich to work
>>>>>
>>>>> Hi,
>>>>>
>>>>> I was wondering if anyone can help me figure out why my MPICH2
>>>>> installation isn't working correctly. I've downloaded the v1.2
>>>>> version, configured using "configure
>>>>> --prefix=/home/su/software/mpich2-1.2", make and make install and
>>>>> everything seemed fine (I've attached the 3 txt output from
>>>>> configure, make and make install which shows no error).  I then
>>>>> tried to see if my installation is working correctly by running the
>>>>> mpich-test suite (result given in summary.xml) and some of the
>>>>> tests failed (collective).  Does anyone know what might be the
>>>>> cause of my problem? Thanks.
>>>>>
>>>>> System spec:
>>>>> 32 nodes Quad-core Xeon cluster
>>>>> Linux version 2.6.9-55.0.2.ELsmp (mockbuild at builder6.centos.org)
>>>>> (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun 26
>>>>> 14:14:47 EDT 2007
>>>>>
>>>>> Hung-Hsun
>>>>>
>>>>> -- 
>>>>>
>>>>> --------------------------------------------------------------
>>>>> ---------------------------------------------
>>>>> Sincerely,
>>>>> Hung-Hsun Su
>>>>> Ph.D. Student, UPC Group Leader, Research Assistant, Teaching
>>>>> Assistant
>>>>> High-performance Computing and Simulation (HCS) Research Laboratory
>>>>> Dept. of Electrical and Computer Engineering , University of Florida,
>>>>> Gainesville, FL 32611-6200
>>>>> Email: su at hcs.ufl.edu, hunghsun at ufl.edu
>>>>> --------------------------------------------------------------
>>>>> ----------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>>
>>> -- 
>>>
>>> -----------------------------------------------------------------------------------------------------------
>>>
>>> Sincerely,
>>> Hung-Hsun Su
>>> Ph.D. Student, UPC Group Leader, Research Assistant, Teaching Assistant
>>> High-performance Computing and Simulation (HCS) Research Laboratory
>>> Dept. of Electrical and Computer Engineering , University of Florida,
>>> Gainesville, FL 32611-6200
>>> Email: su at hcs.ufl.edu, hunghsun at ufl.edu
>>> ------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> <mi.txt><c.txt><m.txt><mpd.hosts><summary.xml><mtest.txt>_______________________________________________
>>>
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list