[mpich-discuss] Need some help getting mpich to work

Hung-Hsun Su su at hcs.ufl.edu
Wed Dec 2 16:39:54 CST 2009


We are using the basic NSF system. It's not the MPI-I/O that I am 
worrying about, its the collective that worries me. I have a few MPI 
programs and they all died when any sort of collective is used (few give 
the error of overlapping memory space).  Any program that doesn't use 
collective works just fine.

I've tried the "mpiexec.hydra -f /home/su/mpd.hosts -n 124 hostname | 
sort | uniq -c | sort -n" command and it basically just hangs there so I 
presume the issue is with something deeper.

I did "cluster-fork mpdcheck -v" and the output seems to be correct, 
here is a part of the output I am getting.
[su at alpha ~]$ cluster-fork mpdcheck -v
compute-0-0:
obtaining hostname via gethostname and getfqdn
gethostname gives  compute-0-0.local
getfqdn gives  compute-0-0.local
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure 
other than 127.0.0.1
gethostbyname_ex:  ('compute-0-0.local', ['compute-0-0'], 
['10.255.255.254'])
gethostbyname_ex:  ('compute-0-0.local', ['compute-0-0'], 
['10.255.255.254'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
compute-0-1:
obtaining hostname via gethostname and getfqdn
gethostname gives  compute-0-1.local
getfqdn gives  compute-0-1.local
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure 
other than 127.0.0.1
gethostbyname_ex:  ('compute-0-1.local', ['compute-0-1'], 
['10.255.255.253'])
gethostbyname_ex:  ('compute-0-1.local', ['compute-0-1'], 
['10.255.255.253'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
...

I've also tried older version of gcc (4.2.4 and 4.3.4 and they all does 
the same thing. LD_LIBRARY_PATH=gcc/lib64 and PATH=gcc/bin added) so at 
this point I am not really sure what is the issue. Do you think this 
could be hardware issue  rather than software? Is there any way to test 
this? (I've tried mpdringtest and mpdcheck and it produces no error)

Hung-Hsun
> Your new "make testing" run is having problems with the MPI-I/O 
> tests.  What sort of file system are you running this on?
>
> As for the mpdboot problems, I'm not sure what is happening.  I would 
> attempt to use the mpdcheck tool described in Appendix A of the 
> Installer's Guide [1] to diagnose the problem.
>
> You might also be having trouble because you are running part of your 
> jobs on the cluster head node.  MPD's mpiexec will attempt to run 1 
> process locally first unless passed the "-1" option.  Unfortunately 
> there is no easy way to pass that option to the "make testing" process.
>
> You can also try using hydra instead of MPD.  You should just be able 
> to run "mpiexec.hydra -f /home/su/mpd.hosts -n 124 hostname | sort | 
> uniq -c | sort -n" to sanity check that it works (you should get 31 
> lines, each starting with 4).  If it does work for you, you can 
> rebuild MPICH2 with "--enable-pm=hydra,mpd" to make hydra the default 
> mpiexec.  Hydra will use the hostfile specified on the command line 
> but it will also look at the file specified by the $HYDRA_HOST_FILE 
> environment variable.  See [2] for more information on using hydra.
>
> -Dave
>
> [1] 
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.2-installguide.pdf 
>
> [2] 
> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
>
> On Nov 24, 2009, at 12:50 PM, Hung-Hsun Su wrote:
>
>> Unfortunately, the latest release did not solve the issue. It 
>> actually introduces new bug. After installation, mpdboot cannot setup 
>> the environment correctly. It freezes when I make the following call.
>>
>> [su at alpha ~]$ which mpdboot
>> /home/su/software/mpich2-1.2.1/bin/mpdboot
>> [su at alpha ~]$ mpdboot -n 32 --ncpus=1 -f /home/su/mpd.hosts
>>
>> After I killed the process and found out that only 4/31 of the 
>> compute nodes were setup
>>
>> [su at alpha ~]$ mpdtrace
>> alpha
>> compute-0-3
>> compute-0-1
>> compute-0-0
>> compute-0-2
>>
>> I then tried setting up using the 1.2 version and it works fine.
>>
>> [su at alpha ~]$ /home/su/software/mpich2-1.2/bin/mpdboot -n 32 
>> --ncpus=1 -f /home/su/mpd.hosts
>> [su at alpha ~]$ mpdtrace
>> alpha
>> compute-0-3
>> compute-0-11
>> compute-0-10
>> compute-0-9
>> compute-0-8
>> compute-0-1
>> compute-0-15
>> compute-0-14
>> compute-0-13
>> compute-0-12
>> compute-0-0
>> compute-0-19
>> compute-0-27
>> compute-0-26
>> compute-0-25
>> compute-0-24
>> compute-0-18
>> compute-0-30
>> compute-0-29
>> compute-0-28
>> compute-0-17
>> compute-0-16
>> compute-0-2
>> compute-0-7
>> compute-0-6
>> compute-0-5
>> compute-0-4
>> compute-0-23
>> compute-0-22
>> compute-0-21
>> compute-0-20
>>
>> I then tried ran make testing and got even more error. Anyone has an 
>> idea of what is going on?
>>
>> Hung-Hsun
>>
>> PS. I've attached the output files from various steps
>> c.txt - configuration
>> m.txt - make
>> mi.txt - make install
>> mpd.hosts - my machine file
>> mtest.txt - make testing
>> summary.xml - output in test/mpi directory from make testing
>>
>>> Can you try the latest release 1.2.1?
>>>
>>> Rajeev
>>>
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Hung-Hsun Su
>>>> Sent: Monday, November 23, 2009 11:16 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: [mpich-discuss] Need some help getting mpich to work
>>>>
>>>> Hi,
>>>>
>>>> I was wondering if anyone can help me figure out why my MPICH2 
>>>> installation isn't working correctly. I've downloaded the v1.2 
>>>> version, configured using "configure 
>>>> --prefix=/home/su/software/mpich2-1.2", make and make install and 
>>>> everything seemed fine (I've attached the 3 txt output from 
>>>> configure, make and make install which shows no error).  I then 
>>>> tried to see if my installation is working correctly by running the 
>>>> mpich-test suite (result given in summary.xml) and some of the 
>>>> tests failed (collective).  Does anyone know what might be the 
>>>> cause of my problem? Thanks.
>>>>
>>>> System spec:
>>>> 32 nodes Quad-core Xeon cluster
>>>> Linux version 2.6.9-55.0.2.ELsmp (mockbuild at builder6.centos.org) 
>>>> (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun 26 
>>>> 14:14:47 EDT 2007
>>>>
>>>> Hung-Hsun
>>>>
>>>> -- 
>>>>
>>>> --------------------------------------------------------------
>>>> ---------------------------------------------
>>>> Sincerely,
>>>> Hung-Hsun Su
>>>> Ph.D. Student, UPC Group Leader, Research Assistant, Teaching 
>>>> Assistant
>>>> High-performance Computing and Simulation (HCS) Research Laboratory
>>>> Dept. of Electrical and Computer Engineering , University of Florida,
>>>> Gainesville, FL 32611-6200
>>>> Email: su at hcs.ufl.edu, hunghsun at ufl.edu
>>>> --------------------------------------------------------------
>>>> ----------------------------------------------
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>>
>> -- 
>>
>> ----------------------------------------------------------------------------------------------------------- 
>>
>> Sincerely,
>> Hung-Hsun Su
>> Ph.D. Student, UPC Group Leader, Research Assistant, Teaching Assistant
>> High-performance Computing and Simulation (HCS) Research Laboratory
>> Dept. of Electrical and Computer Engineering , University of Florida,
>> Gainesville, FL 32611-6200
>> Email: su at hcs.ufl.edu, hunghsun at ufl.edu
>> ------------------------------------------------------------------------------------------------------------ 
>>
>>
>> <mi.txt><c.txt><m.txt><mpd.hosts><summary.xml><mtest.txt>_______________________________________________ 
>>
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>


-- 

-----------------------------------------------------------------------------------------------------------
Sincerely,
 
Hung-Hsun Su
 
Ph.D. Student, UPC Group Leader, Research Assistant, Teaching Assistant
High-performance Computing and Simulation (HCS) Research Laboratory
Dept. of Electrical and Computer Engineering , University of Florida,
Gainesville, FL 32611-6200
Email: su at hcs.ufl.edu, hunghsun at ufl.edu
------------------------------------------------------------------------------------------------------------



More information about the mpich-discuss mailing list