[mpich-discuss] Problems running mpi applicationon differentCPUs

Gaetano Bellanca gaetano.bellanca at unife.it
Sat Oct 3 17:49:08 CDT 2009


Dear Rajeev,

problems only with allred. 750 errors!

Regards.

Gaetano


Rajeev Thakur ha scritto:
> Just run a few from the test/mpi/coll directory by hand. Run make in
> that directory, then do mpiexec -n 5 name_of_executable. If they run,
> there may be a bug in your code.
>
> Rajeev 
>
>   
>> -----Original Message-----
>> From: Gaetano Bellanca [mailto:gaetano.bellanca at unife.it] 
>> Sent: Saturday, October 03, 2009 2:58 AM
>> To: Rajeev Thakur
>> Cc: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Problems running mpi 
>> applicationon differentCPUs
>>
>> Dear Rajeev
>>
>> cpi test (and other tests in examples directory) works 
>> without problems.
>> I tried to run make in the test directory and had this error message:
>>
>> make[2]: Entering directory
>> `/home/bellanca/software/mpich2/1.1.1p1/mpich2-1.1.1p1_intel/t
>> est/mpid/ch3'
>> make[2]: *** No rule to make target `../../../lib/lib.a', 
>> needed by `reorder'.  Stop.
>>
>> How should I run the tests from the  ../test directory?
>> I tried with make testing, but I had a lot of unexpected 
>> output from mpd
>>
>> Regards
>>
>> Gaetano
>>
>> Rajeev Thakur ha scritto:
>>     
>>> Try running the cpi example from the mpich2/examples directory. Try 
>>> running some of the tests in test/mpi.
>>>
>>> Rajeev
>>>
>>>   
>>>       
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gaetano 
>>>> Bellanca
>>>> Sent: Tuesday, September 29, 2009 9:13 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] Problems running mpi applicationon 
>>>> differentCPUs
>>>>
>>>> Dear Rajeev,
>>>>
>>>> I tested as indicated in the appendix of the installation 
>>>>         
>> guide using 
>>     
>>>> mpdcheck (and also changing the device using ch3:nemesis 
>>>>         
>> (thank you 
>>     
>>>> for the suggestion, Cye)) but nothing changes in the 
>>>>         
>> behavior of the 
>>     
>>>> code.
>>>> However I noted that, changing the machinefile for not to have the 
>>>> server machine as a first one, I have different behaviors. In 
>>>> particular running with mpiexec -machinefile my_machinefile -n 6 
>>>> my_parallel_code and:
>>>> 1) machinefile as follows
>>>> server
>>>> server
>>>> client1
>>>> client2
>>>> client3
>>>> client4
>>>> client5
>>>>
>>>> the code starts with the previous error Fatal error in MPI_Init: 
>>>> Other MPI error, error stack:
>>>>  > >> MPIR_Init_thread(394): Initialization failed  > >>
>>>> (unknown)(): Other MPI error
>>>>  > >> rank 3 in job 8  c1_4545   caused collective abort 
>>>>         
>> of all ranks
>>     
>>>>  > >>  exit status of rank 3: return code 1
>>>>
>>>> 2) machinefile as follows
>>>> client2
>>>> client3
>>>> client4
>>>> client5
>>>> server
>>>> server
>>>>
>>>> the code starts with a SIGSEGV segmentation fault at the 
>>>>         
>> line of the 
>>     
>>>> MPI_INIT
>>>>
>>>> 3) machinefile as follows
>>>> client2
>>>> client3
>>>> server
>>>> server
>>>> client4
>>>> client5
>>>>
>>>> the code starts regularly, but stops working in the first file 
>>>> writing procedure.
>>>> It produces a file of 0 bytes, but does not advance in any other 
>>>> procedure, and I have to kill to terminate.
>>>>
>>>> Could it be something relevant to a 
>>>>         
>> synchronization/timeout between 
>>     
>>>> the different machines?
>>>> On another cluster (all Pentium IV 3GHz), the same program 
>>>>         
>> is slower 
>>     
>>>> to start when launched, but everything works fine.
>>>>
>>>> Regards.
>>>>
>>>> Gaetano
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Rajeev Thakur ha scritto:
>>>>     
>>>>         
>>>>> ch3:sock won't perform as well as ch3:nemesis though.
>>>>>  
>>>>> Rajeev
>>>>>
>>>>>     
>>>>>       
>>>>>           
>>>> --------------------------------------------------------------
>>>> ----------
>>>>     
>>>>         
>>>>>     *From:* mpich-discuss-bounces at mcs.anl.gov
>>>>>     [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf
>>>>>       
>>>>>           
>>>> Of *Cye Stoner
>>>>     
>>>>         
>>>>>     *Sent:* Monday, September 28, 2009 4:32 PM
>>>>>     *To:* mpich-discuss at mcs.anl.gov
>>>>>     *Subject:* Re: [mpich-discuss] Problems running mpi
>>>>>       
>>>>>           
>>>> application on
>>>>     
>>>>         
>>>>>     differentCPUs
>>>>>
>>>>>     When deploying MPICH2 to a small cluster, I noticed
>>>>>       
>>>>>           
>>>> that many had
>>>>     
>>>>         
>>>>>     problems with the "--with-device=ch3:nemesis"
>>>>>     Try using the "--with-device=ch3:sock" interface instead.
>>>>>      
>>>>>     Cye
>>>>>
>>>>>     On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
>>>>>     <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>>>>
>>>>>         Try using the mpdcheck utility to debug as 
>>>>>           
>> described in the
>>     
>>>>>         appendix of
>>>>>         the installation guide. Pick one client and the server.
>>>>>
>>>>>         Rajeev
>>>>>
>>>>>         > -----Original Message-----
>>>>>         > From: mpich-discuss-bounces at mcs.anl.gov
>>>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>>>>         > [mailto:mpich-discuss-bounces at mcs.anl.gov
>>>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>>>>         > Gaetano Bellanca
>>>>>         > Sent: Monday, September 28, 2009 6:00 AM
>>>>>         > Cc: mpich-discuss at mcs.anl.gov
>>>>>       
>>>>>           
>>>> <mailto:mpich-discuss at mcs.anl.gov>
>>>>     
>>>>         
>>>>>         > Subject: Re: [mpich-discuss] Problems running mpi
>>>>>       
>>>>>           
>>>> application
>>>>     
>>>>         
>>>>>         > on different CPUs
>>>>>         >
>>>>>         > Dear Rajeev,
>>>>>         >
>>>>>         > thanks for your help. I disabled the firewall on
>>>>>       
>>>>>           
>>>> the server
>>>>     
>>>>         
>>>>>         (the only
>>>>>         > one running) and tried with any other combination.
>>>>>         > All the clients together are running correctly. 
>>>>>       
>>>>>           
>>>> The same for the
>>>>     
>>>>         
>>>>>         > processors on the server separately.
>>>>>         > The problem is only when I mix processes on the
>>>>>       
>>>>>           
>>>> server and on
>>>>     
>>>>         
>>>>>         > the client.
>>>>>         >
>>>>>         > When I run mpdtrace on the server, all the CPUs are
>>>>>         > responding correctly.
>>>>>         > The same happens if I run in parallel 'hostname'
>>>>>         >
>>>>>         > Probably, it is a problem of my code, but it works on a
>>>>>         cluster of 10
>>>>>         > Pentium IV PEs.
>>>>>         > I discover a 'strange behavior':
>>>>>         > 1) running the code with the server as a first
>>>>>       
>>>>>           
>>>> machine of the
>>>>     
>>>>         
>>>>>         > pool, the
>>>>>         > code hangs with the previously communicated error;
>>>>>         > 2) if I put the server as a second machine of the
>>>>>       
>>>>>           
>>>> pool, the
>>>>     
>>>>         
>>>>>         > code starts
>>>>>         > and works regularly up to the writing procedures,
>>>>>       
>>>>>           
>>>> opens the
>>>>     
>>>>         
>>>>>         > first file
>>>>>         > and then remains indefinitely waiting for something;
>>>>>         >
>>>>>         > Should I compile mpich2 with some particular
>>>>>       
>>>>>           
>>>> communicator? I
>>>>     
>>>>         
>>>>>         have
>>>>>         > nemesis, at the moment.
>>>>>         > I'm using this for mpich2 compilation:
>>>>>         > ./configure --prefix=/opt/mpich2/1.1/intel11.1
>>>>>       
>>>>>           
>>>> --enable-cxx
>>>>     
>>>>         
>>>>>         > --enable-f90
>>>>>         > --enable-fast --enable-traceback --with-mpe
>>>>>       
>>>>>           
>>>> --enable-f90modules
>>>>     
>>>>         
>>>>>         > --enable-cache --enable-romio
>>>>>       
>>>>>           
>>>> --with-file-system=nfs+ufs+pvfs2
>>>>     
>>>>         
>>>>>         > --with-device=ch3:nemesis --with-pvfs2=/usr/local
>>>>>         > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
>>>>>         --with-pm=mpd:hydra
>>>>>         >
>>>>>         > Regards
>>>>>         >
>>>>>         > Gaetano
>>>>>         >
>>>>>         > Rajeev Thakur ha scritto:
>>>>>         > > Try running on smaller subsets of the machines
>>>>>       
>>>>>           
>>>> to debug the
>>>>     
>>>>         
>>>>>         > problem. It
>>>>>         > > is possible that a process on some machine
>>>>>       
>>>>>           
>>>> cannot connect
>>>>     
>>>>         
>>>>>         to another
>>>>>         > > because of some firewall settings.
>>>>>         > >
>>>>>         > > Rajeev
>>>>>         > >
>>>>>         > >
>>>>>         > >> -----Original Message-----
>>>>>         > >> From: mpich-discuss-bounces at mcs.anl.gov
>>>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>>>>         > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
>>>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>>>>         > Gaetano Bellanca
>>>>>         > >> Sent: Saturday, September 26, 2009 7:10 AM
>>>>>         > >> To: mpich-discuss at mcs.anl.gov
>>>>>         <mailto:mpich-discuss at mcs.anl.gov>
>>>>>         > >> Subject: [mpich-discuss] Problems running mpi
>>>>>       
>>>>>           
>>>> application on
>>>>     
>>>>         
>>>>>         > >> different CPUs
>>>>>         > >>
>>>>>         > >> Hi,
>>>>>         > >>
>>>>>         > >> I'm sorry but  I posted with a wrong Object 
>>>>>           
>> my previous
>>     
>>>>>         message!!!
>>>>>         > >>
>>>>>         > >> I have a small cluster of
>>>>>         > >> a) 1 server: dual processor / quad core
>>>>>       
>>>>>           
>>>> Intel(R) Xeon(R)
>>>>     
>>>>         
>>>>>         CPU  E5345
>>>>>         > >> b) 4 clients: single processor / dual core Intel(R)
>>>>>         > Core(TM)2 Duo CPU
>>>>>         > >> E8400 connected  with a 1Gbit/s ethernet network.
>>>>>         > >>
>>>>>         > >> I compiled mpich2-1.1.1p1 on the first system (a) and
>>>>>         > share mpich on
>>>>>         > >> the other computers via nfs. I have mpd
>>>>>       
>>>>>           
>>>> running as a root
>>>>     
>>>>         
>>>>>         > on all the
>>>>>         > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
>>>>>         > >>
>>>>>         > >> When I run my code in parallel on the first
>>>>>       
>>>>>           
>>>> system, it works
>>>>     
>>>>         
>>>>>         > >> correctly; the same happens running the same code  in
>>>>>         > parallel on the
>>>>>         > >> other computers (always running the code from the
>>>>>         server). On the
>>>>>         > >> contrary, running the code using processors
>>>>>       
>>>>>           
>>>> from both the
>>>>     
>>>>         
>>>>>         > server and
>>>>>         > >> the clients at the same time with the command:
>>>>>         > >>
>>>>>         > >> mpiexec -machinefile machinefile -n 4 
>>>>>           
>> my_parallel_code
>>     
>>>>>         > >>
>>>>>         > >> I receive this error message:
>>>>>         > >>
>>>>>         > >> Fatal error in MPI_Init: Other MPI error, 
>>>>>           
>> error stack:
>>     
>>>>>         > >> MPIR_Init_thread(394): Initialization failed
>>>>>         > >> (unknown)(): Other MPI error
>>>>>         > >> rank 3 in job 8  c1_4545   caused collective 
>>>>>       
>>>>>           
>>>> abort of all
>>>>     
>>>>         
>>>>>         ranks
>>>>>         > >>  exit status of rank 3: return code 1
>>>>>         > >>
>>>>>         > >> Should I use some particular flags in 
>>>>>           
>> compilation or at
>>     
>>>>>         run time?
>>>>>         > >>
>>>>>         > >> Regards.
>>>>>         > >>
>>>>>         > >> Gaetano
>>>>>         > >>
>>>>>         > >> --
>>>>>         > >> Gaetano Bellanca - Department of Engineering -
>>>>>       
>>>>>           
>>>> University
>>>>     
>>>>         
>>>>>         > of Ferrara
>>>>>         > >> Via Saragat, 1 - 44100 - Ferrara - ITALY 
>>>>>           
>> Voice (VoIP):
>>     
>>>>>         +39 0532
>>>>>         > >> 974809 Fax: +39 0532 974870
>>>>>         mailto:gaetano.bellanca at unife.it
>>>>>         <mailto:gaetano.bellanca at unife.it>
>>>>>         > >>
>>>>>         > >> L'istruzione costa? Stanno provando con l'ignoranza!
>>>>>         > >>
>>>>>         > >>
>>>>>         > >>
>>>>>         > >
>>>>>         > >
>>>>>         > >
>>>>>         >
>>>>>         > --
>>>>>         > Gaetano Bellanca - Department of Engineering -
>>>>>       
>>>>>           
>>>> University of
>>>>     
>>>>         
>>>>>         Ferrara
>>>>>         > Via Saragat, 1 - 44100 - Ferrara - ITALY
>>>>>         > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
>>>>>         > mailto:gaetano.bellanca at unife.it
>>>>>         <mailto:gaetano.bellanca at unife.it>
>>>>>         >
>>>>>         > L'istruzione costa? Stanno provando con l'ignoranza!
>>>>>         >
>>>>>         >
>>>>>         >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     -- 
>>>>>     "If you already know what recursion is, just remember
>>>>>       
>>>>>           
>>>> the answer.
>>>>     
>>>>         
>>>>>     Otherwise, find someone who is standing closer to
>>>>>     Douglas Hofstadter than you are; then ask him or her what
>>>>>     recursion is." - Andrew Plotkin
>>>>>
>>>>>       
>>>>>           
>>>> --
>>>> Gaetano Bellanca - Department of Engineering - University 
>>>>         
>> of Ferrara 
>>     
>>>> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
>>>> (VoIP): +39 0532 974809 Fax: +39 0532 974870 
>>>> mailto:gaetano.bellanca at unife.it
>>>>
>>>> L'istruzione costa? Stanno provando con l'ignoranza!
>>>>
>>>>
>>>>     
>>>>         
>>>   
>>>       
>> --
>> Gaetano Bellanca - Department of Engineering - University of 
>> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice 
>> (VoIP): +39 0532 974809 Fax: +39 0532 974870 
>> mailto:gaetano.bellanca at unife.it
>>
>> L'istruzione costa? Stanno provando con l'ignoranza!
>>
>>
>>     
>
>
>   

-- 
Gaetano Bellanca - Department of Engineering - University of Ferrara
Via Saragat, 1 - 44100 - Ferrara - ITALY
Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
mailto:gaetano.bellanca at unife.it

L'istruzione costa? Stanno provando con l'ignoranza!



More information about the mpich-discuss mailing list