[mpich-discuss] Need some help getting mpich to work

Hung-Hsun Su su at hcs.ufl.edu
Wed Dec 2 19:07:02 CST 2009


> Are you using mpich2-1.2.1 ?
>   
Yes, I am using 1.2.1
[su at alpha ~]$ which mpiexec.hydra
/home/su/software/mpich2-1.2.1/bin/mpiexec.hydra

> % mpiexec.hydra -info
>   
[su at alpha ~]$ mpiexec.hydra -info
HYDRA build details:
    Version:                                 1.2.1
    CC:                                      gcc
    CXX:                                     c++
    F77:                                     g77
    F90:                                     gfortran
    Configure options:                        
'--prefix=/home/su/software/mpich2-1.2.1' 
'--with-atomic-primitives=auto_allow_emulation' 'CC=gcc' 'CFLAGS= -O2' 
'LDFLAGS=' 'LIBS= -lpthread   -lrt   ' 'CPPFLAGS= 
-I/home/su/source/mpich2/mpich2-1.2.1/src/openpa/src 
-I/home/su/source/mpich2/mpich2-1.2.1/src/openpa/src -DUSE_PROCESS_LOCKS'
    Process Manager:                         pmi
    Boot-strap servers available:            ssh rsh fork slurm
    Communication sub-systems available:     none
    Binding libraries available:             plpa
    Checkpointing libraries available:       none

> Can you try running Hydra in verbose mode to see what's going on?
>
> % mpiexec.hydra -verbose -f /home/su/mpd.hosts -n 124 hostname
>   
I've attached the output file from executing the above command. Noted 
that I have to kill the process cause it just freezes. This is the same 
kind of behavior that I am seeing before: for some reason anything 
beyond compute node 0 is not running.

Hung-Hsun
>  -- Pavan
>
> On 12/02/2009 04:39 PM, Hung-Hsun Su wrote:
>   
>> We are using the basic NSF system. It's not the MPI-I/O that I am
>> worrying about, its the collective that worries me. I have a few MPI
>> programs and they all died when any sort of collective is used (few give
>> the error of overlapping memory space).  Any program that doesn't use
>> collective works just fine.
>>
>> I've tried the "mpiexec.hydra -f /home/su/mpd.hosts -n 124 hostname |
>> sort | uniq -c | sort -n" command and it basically just hangs there so I
>> presume the issue is with something deeper.
>>
>> I did "cluster-fork mpdcheck -v" and the output seems to be correct,
>> here is a part of the output I am getting.
>> [su at alpha ~]$ cluster-fork mpdcheck -v
>> compute-0-0:
>> obtaining hostname via gethostname and getfqdn
>> gethostname gives  compute-0-0.local
>> getfqdn gives  compute-0-0.local
>> checking out unqualified hostname; make sure is not "localhost", etc.
>> checking out qualified hostname; make sure is not "localhost", etc.
>> obtain IP addrs via qualified and unqualified hostnames;  make sure
>> other than 127.0.0.1
>> gethostbyname_ex:  ('compute-0-0.local', ['compute-0-0'],
>> ['10.255.255.254'])
>> gethostbyname_ex:  ('compute-0-0.local', ['compute-0-0'],
>> ['10.255.255.254'])
>> checking that IP addrs resolve to same host
>> now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
>> compute-0-1:
>> obtaining hostname via gethostname and getfqdn
>> gethostname gives  compute-0-1.local
>> getfqdn gives  compute-0-1.local
>> checking out unqualified hostname; make sure is not "localhost", etc.
>> checking out qualified hostname; make sure is not "localhost", etc.
>> obtain IP addrs via qualified and unqualified hostnames;  make sure
>> other than 127.0.0.1
>> gethostbyname_ex:  ('compute-0-1.local', ['compute-0-1'],
>> ['10.255.255.253'])
>> gethostbyname_ex:  ('compute-0-1.local', ['compute-0-1'],
>> ['10.255.255.253'])
>> checking that IP addrs resolve to same host
>> now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
>> ...
>>
>> I've also tried older version of gcc (4.2.4 and 4.3.4 and they all does
>> the same thing. LD_LIBRARY_PATH=gcc/lib64 and PATH=gcc/bin added) so at
>> this point I am not really sure what is the issue. Do you think this
>> could be hardware issue  rather than software? Is there any way to test
>> this? (I've tried mpdringtest and mpdcheck and it produces no error)
>>
>> Hung-Hsun
>>     
>>> Your new "make testing" run is having problems with the MPI-I/O
>>> tests.  What sort of file system are you running this on?
>>>
>>> As for the mpdboot problems, I'm not sure what is happening.  I would
>>> attempt to use the mpdcheck tool described in Appendix A of the
>>> Installer's Guide [1] to diagnose the problem.
>>>
>>> You might also be having trouble because you are running part of your
>>> jobs on the cluster head node.  MPD's mpiexec will attempt to run 1
>>> process locally first unless passed the "-1" option.  Unfortunately
>>> there is no easy way to pass that option to the "make testing" process.
>>>
>>> You can also try using hydra instead of MPD.  You should just be able
>>> to run "mpiexec.hydra -f /home/su/mpd.hosts -n 124 hostname | sort |
>>> uniq -c | sort -n" to sanity check that it works (you should get 31
>>> lines, each starting with 4).  If it does work for you, you can
>>> rebuild MPICH2 with "--enable-pm=hydra,mpd" to make hydra the default
>>> mpiexec.  Hydra will use the hostfile specified on the command line
>>> but it will also look at the file specified by the $HYDRA_HOST_FILE
>>> environment variable.  See [2] for more information on using hydra.
>>>
>>> -Dave
>>>
>>> [1]
>>> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.2-installguide.pdf
>>>
>>> [2]
>>> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
>>>
>>> On Nov 24, 2009, at 12:50 PM, Hung-Hsun Su wrote:
>>>
>>>       
>>>> Unfortunately, the latest release did not solve the issue. It
>>>> actually introduces new bug. After installation, mpdboot cannot setup
>>>> the environment correctly. It freezes when I make the following call.
>>>>
>>>> [su at alpha ~]$ which mpdboot
>>>> /home/su/software/mpich2-1.2.1/bin/mpdboot
>>>> [su at alpha ~]$ mpdboot -n 32 --ncpus=1 -f /home/su/mpd.hosts
>>>>
>>>> After I killed the process and found out that only 4/31 of the
>>>> compute nodes were setup
>>>>
>>>> [su at alpha ~]$ mpdtrace
>>>> alpha
>>>> compute-0-3
>>>> compute-0-1
>>>> compute-0-0
>>>> compute-0-2
>>>>
>>>> I then tried setting up using the 1.2 version and it works fine.
>>>>
>>>> [su at alpha ~]$ /home/su/software/mpich2-1.2/bin/mpdboot -n 32
>>>> --ncpus=1 -f /home/su/mpd.hosts
>>>> [su at alpha ~]$ mpdtrace
>>>> alpha
>>>> compute-0-3
>>>> compute-0-11
>>>> compute-0-10
>>>> compute-0-9
>>>> compute-0-8
>>>> compute-0-1
>>>> compute-0-15
>>>> compute-0-14
>>>> compute-0-13
>>>> compute-0-12
>>>> compute-0-0
>>>> compute-0-19
>>>> compute-0-27
>>>> compute-0-26
>>>> compute-0-25
>>>> compute-0-24
>>>> compute-0-18
>>>> compute-0-30
>>>> compute-0-29
>>>> compute-0-28
>>>> compute-0-17
>>>> compute-0-16
>>>> compute-0-2
>>>> compute-0-7
>>>> compute-0-6
>>>> compute-0-5
>>>> compute-0-4
>>>> compute-0-23
>>>> compute-0-22
>>>> compute-0-21
>>>> compute-0-20
>>>>
>>>> I then tried ran make testing and got even more error. Anyone has an
>>>> idea of what is going on?
>>>>
>>>> Hung-Hsun
>>>>
>>>> PS. I've attached the output files from various steps
>>>> c.txt - configuration
>>>> m.txt - make
>>>> mi.txt - make install
>>>> mpd.hosts - my machine file
>>>> mtest.txt - make testing
>>>> summary.xml - output in test/mpi directory from make testing
>>>>
>>>>         
>>>>> Can you try the latest release 1.2.1?
>>>>>
>>>>> Rajeev
>>>>>
>>>>>           
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Hung-Hsun Su
>>>>>> Sent: Monday, November 23, 2009 11:16 AM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: [mpich-discuss] Need some help getting mpich to work
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering if anyone can help me figure out why my MPICH2
>>>>>> installation isn't working correctly. I've downloaded the v1.2
>>>>>> version, configured using "configure
>>>>>> --prefix=/home/su/software/mpich2-1.2", make and make install and
>>>>>> everything seemed fine (I've attached the 3 txt output from
>>>>>> configure, make and make install which shows no error).  I then
>>>>>> tried to see if my installation is working correctly by running the
>>>>>> mpich-test suite (result given in summary.xml) and some of the
>>>>>> tests failed (collective).  Does anyone know what might be the
>>>>>> cause of my problem? Thanks.
>>>>>>
>>>>>> System spec:
>>>>>> 32 nodes Quad-core Xeon cluster
>>>>>> Linux version 2.6.9-55.0.2.ELsmp (mockbuild at builder6.centos.org)
>>>>>> (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun 26
>>>>>> 14:14:47 EDT 2007
>>>>>>
>>>>>> Hung-Hsun
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> --------------------------------------------------------------
>>>>>> ---------------------------------------------
>>>>>> Sincerely,
>>>>>> Hung-Hsun Su
>>>>>> Ph.D. Student, UPC Group Leader, Research Assistant, Teaching
>>>>>> Assistant
>>>>>> High-performance Computing and Simulation (HCS) Research Laboratory
>>>>>> Dept. of Electrical and Computer Engineering , University of Florida,
>>>>>> Gainesville, FL 32611-6200
>>>>>> Email: su at hcs.ufl.edu, hunghsun at ufl.edu
>>>>>> --------------------------------------------------------------
>>>>>> ----------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>>           
>>>> -- 
>>>>
>>>> -----------------------------------------------------------------------------------------------------------
>>>>
>>>> Sincerely,
>>>> Hung-Hsun Su
>>>> Ph.D. Student, UPC Group Leader, Research Assistant, Teaching Assistant
>>>> High-performance Computing and Simulation (HCS) Research Laboratory
>>>> Dept. of Electrical and Computer Engineering , University of Florida,
>>>> Gainesville, FL 32611-6200
>>>> Email: su at hcs.ufl.edu, hunghsun at ufl.edu
>>>> ------------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> <mi.txt><c.txt><m.txt><mpd.hosts><summary.xml><mtest.txt>_______________________________________________
>>>>
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>         
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>       
>>     
>
>   


-- 

-----------------------------------------------------------------------------------------------------------
Sincerely,
 
Hung-Hsun Su
 
Ph.D. Student, UPC Group Leader, Research Assistant, Teaching Assistant
High-performance Computing and Simulation (HCS) Research Laboratory
Dept. of Electrical and Computer Engineering , University of Florida,
Gainesville, FL 32611-6200
Email: su at hcs.ufl.edu, hunghsun at ufl.edu
------------------------------------------------------------------------------------------------------------

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: output.txt
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091202/dfec125f/attachment-0001.txt>


More information about the mpich-discuss mailing list