[mpich-discuss] Problems running mpi applicationon differentCPUs

Gaetano Bellanca gaetano.bellanca at unife.it
Sat Oct 3 02:57:35 CDT 2009


Dear Rajeev

cpi test (and other tests in examples directory) works without problems.
I tried to run make in the test directory and had this error message:

make[2]: Entering directory 
`/home/bellanca/software/mpich2/1.1.1p1/mpich2-1.1.1p1_intel/test/mpid/ch3'
make[2]: *** No rule to make target `../../../lib/lib.a', needed by 
`reorder'.  Stop.

How should I run the tests from the  ../test directory?
I tried with make testing, but I had a lot of unexpected output from mpd

Regards

Gaetano

Rajeev Thakur ha scritto:
> Try running the cpi example from the mpich2/examples directory. Try
> running some of the tests in test/mpi.
>
> Rajeev 
>
>   
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
>> Gaetano Bellanca
>> Sent: Tuesday, September 29, 2009 9:13 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Problems running mpi 
>> applicationon differentCPUs
>>
>> Dear Rajeev,
>>
>> I tested as indicated in the appendix of the installation 
>> guide using mpdcheck (and also changing the device using 
>> ch3:nemesis (thank you for the suggestion, Cye)) but nothing 
>> changes in the behavior of the code.
>> However I noted that, changing the machinefile for not to 
>> have the server machine as a first one, I have different 
>> behaviors. In particular running with mpiexec -machinefile 
>> my_machinefile -n 6 my_parallel_code and:
>> 1) machinefile as follows
>> server
>> server
>> client1
>> client2
>> client3
>> client4
>> client5
>>
>> the code starts with the previous error
>> Fatal error in MPI_Init: Other MPI error, error stack:
>>  > >> MPIR_Init_thread(394): Initialization failed  > >> 
>> (unknown)(): Other MPI error
>>  > >> rank 3 in job 8  c1_4545   caused collective abort of all ranks
>>  > >>  exit status of rank 3: return code 1
>>
>> 2) machinefile as follows
>> client2
>> client3
>> client4
>> client5
>> server
>> server
>>
>> the code starts with a SIGSEGV segmentation fault at the line 
>> of the MPI_INIT
>>
>> 3) machinefile as follows
>> client2
>> client3
>> server
>> server
>> client4
>> client5
>>
>> the code starts regularly, but stops working in the first 
>> file writing procedure.
>> It produces a file of 0 bytes, but does not advance in any 
>> other procedure, and I have to kill to terminate.
>>
>> Could it be something relevant to a synchronization/timeout 
>> between the different machines?
>> On another cluster (all Pentium IV 3GHz), the same program is 
>> slower to start when launched, but everything works fine.
>>
>> Regards.
>>
>> Gaetano
>>
>>
>>
>>
>>
>> Rajeev Thakur ha scritto:
>>     
>>> ch3:sock won't perform as well as ch3:nemesis though.
>>>  
>>> Rajeev
>>>
>>>     
>>>       
>> --------------------------------------------------------------
>> ----------
>>     
>>>     *From:* mpich-discuss-bounces at mcs.anl.gov
>>>     [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf 
>>>       
>> Of *Cye Stoner
>>     
>>>     *Sent:* Monday, September 28, 2009 4:32 PM
>>>     *To:* mpich-discuss at mcs.anl.gov
>>>     *Subject:* Re: [mpich-discuss] Problems running mpi 
>>>       
>> application on
>>     
>>>     differentCPUs
>>>
>>>     When deploying MPICH2 to a small cluster, I noticed 
>>>       
>> that many had
>>     
>>>     problems with the "--with-device=ch3:nemesis"
>>>     Try using the "--with-device=ch3:sock" interface instead.
>>>      
>>>     Cye
>>>
>>>     On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
>>>     <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>>
>>>         Try using the mpdcheck utility to debug as described in the
>>>         appendix of
>>>         the installation guide. Pick one client and the server.
>>>
>>>         Rajeev
>>>
>>>         > -----Original Message-----
>>>         > From: mpich-discuss-bounces at mcs.anl.gov
>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>>         > [mailto:mpich-discuss-bounces at mcs.anl.gov
>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>>         > Gaetano Bellanca
>>>         > Sent: Monday, September 28, 2009 6:00 AM
>>>         > Cc: mpich-discuss at mcs.anl.gov 
>>>       
>> <mailto:mpich-discuss at mcs.anl.gov>
>>     
>>>         > Subject: Re: [mpich-discuss] Problems running mpi 
>>>       
>> application
>>     
>>>         > on different CPUs
>>>         >
>>>         > Dear Rajeev,
>>>         >
>>>         > thanks for your help. I disabled the firewall on 
>>>       
>> the server
>>     
>>>         (the only
>>>         > one running) and tried with any other combination.
>>>         > All the clients together are running correctly. 
>>>       
>> The same for the
>>     
>>>         > processors on the server separately.
>>>         > The problem is only when I mix processes on the 
>>>       
>> server and on
>>     
>>>         > the client.
>>>         >
>>>         > When I run mpdtrace on the server, all the CPUs are
>>>         > responding correctly.
>>>         > The same happens if I run in parallel 'hostname'
>>>         >
>>>         > Probably, it is a problem of my code, but it works on a
>>>         cluster of 10
>>>         > Pentium IV PEs.
>>>         > I discover a 'strange behavior':
>>>         > 1) running the code with the server as a first 
>>>       
>> machine of the
>>     
>>>         > pool, the
>>>         > code hangs with the previously communicated error;
>>>         > 2) if I put the server as a second machine of the 
>>>       
>> pool, the
>>     
>>>         > code starts
>>>         > and works regularly up to the writing procedures, 
>>>       
>> opens the
>>     
>>>         > first file
>>>         > and then remains indefinitely waiting for something;
>>>         >
>>>         > Should I compile mpich2 with some particular 
>>>       
>> communicator? I
>>     
>>>         have
>>>         > nemesis, at the moment.
>>>         > I'm using this for mpich2 compilation:
>>>         > ./configure --prefix=/opt/mpich2/1.1/intel11.1 
>>>       
>> --enable-cxx
>>     
>>>         > --enable-f90
>>>         > --enable-fast --enable-traceback --with-mpe 
>>>       
>> --enable-f90modules
>>     
>>>         > --enable-cache --enable-romio 
>>>       
>> --with-file-system=nfs+ufs+pvfs2
>>     
>>>         > --with-device=ch3:nemesis --with-pvfs2=/usr/local
>>>         > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
>>>         --with-pm=mpd:hydra
>>>         >
>>>         > Regards
>>>         >
>>>         > Gaetano
>>>         >
>>>         > Rajeev Thakur ha scritto:
>>>         > > Try running on smaller subsets of the machines 
>>>       
>> to debug the
>>     
>>>         > problem. It
>>>         > > is possible that a process on some machine 
>>>       
>> cannot connect
>>     
>>>         to another
>>>         > > because of some firewall settings.
>>>         > >
>>>         > > Rajeev
>>>         > >
>>>         > >
>>>         > >> -----Original Message-----
>>>         > >> From: mpich-discuss-bounces at mcs.anl.gov
>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>>>         > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
>>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>>>         > Gaetano Bellanca
>>>         > >> Sent: Saturday, September 26, 2009 7:10 AM
>>>         > >> To: mpich-discuss at mcs.anl.gov
>>>         <mailto:mpich-discuss at mcs.anl.gov>
>>>         > >> Subject: [mpich-discuss] Problems running mpi 
>>>       
>> application on
>>     
>>>         > >> different CPUs
>>>         > >>
>>>         > >> Hi,
>>>         > >>
>>>         > >> I'm sorry but  I posted with a wrong Object my previous
>>>         message!!!
>>>         > >>
>>>         > >> I have a small cluster of
>>>         > >> a) 1 server: dual processor / quad core 
>>>       
>> Intel(R) Xeon(R)
>>     
>>>         CPU  E5345
>>>         > >> b) 4 clients: single processor / dual core Intel(R)
>>>         > Core(TM)2 Duo CPU
>>>         > >> E8400 connected  with a 1Gbit/s ethernet network.
>>>         > >>
>>>         > >> I compiled mpich2-1.1.1p1 on the first system (a) and
>>>         > share mpich on
>>>         > >> the other computers via nfs. I have mpd 
>>>       
>> running as a root
>>     
>>>         > on all the
>>>         > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
>>>         > >>
>>>         > >> When I run my code in parallel on the first 
>>>       
>> system, it works
>>     
>>>         > >> correctly; the same happens running the same code  in
>>>         > parallel on the
>>>         > >> other computers (always running the code from the
>>>         server). On the
>>>         > >> contrary, running the code using processors 
>>>       
>> from both the
>>     
>>>         > server and
>>>         > >> the clients at the same time with the command:
>>>         > >>
>>>         > >> mpiexec -machinefile machinefile -n 4 my_parallel_code
>>>         > >>
>>>         > >> I receive this error message:
>>>         > >>
>>>         > >> Fatal error in MPI_Init: Other MPI error, error stack:
>>>         > >> MPIR_Init_thread(394): Initialization failed
>>>         > >> (unknown)(): Other MPI error
>>>         > >> rank 3 in job 8  c1_4545   caused collective 
>>>       
>> abort of all
>>     
>>>         ranks
>>>         > >>  exit status of rank 3: return code 1
>>>         > >>
>>>         > >> Should I use some particular flags in compilation or at
>>>         run time?
>>>         > >>
>>>         > >> Regards.
>>>         > >>
>>>         > >> Gaetano
>>>         > >>
>>>         > >> --
>>>         > >> Gaetano Bellanca - Department of Engineering - 
>>>       
>> University
>>     
>>>         > of Ferrara
>>>         > >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP):
>>>         +39 0532
>>>         > >> 974809 Fax: +39 0532 974870
>>>         mailto:gaetano.bellanca at unife.it
>>>         <mailto:gaetano.bellanca at unife.it>
>>>         > >>
>>>         > >> L'istruzione costa? Stanno provando con l'ignoranza!
>>>         > >>
>>>         > >>
>>>         > >>
>>>         > >
>>>         > >
>>>         > >
>>>         >
>>>         > --
>>>         > Gaetano Bellanca - Department of Engineering - 
>>>       
>> University of
>>     
>>>         Ferrara
>>>         > Via Saragat, 1 - 44100 - Ferrara - ITALY
>>>         > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
>>>         > mailto:gaetano.bellanca at unife.it
>>>         <mailto:gaetano.bellanca at unife.it>
>>>         >
>>>         > L'istruzione costa? Stanno provando con l'ignoranza!
>>>         >
>>>         >
>>>         >
>>>
>>>
>>>
>>>
>>>     -- 
>>>     "If you already know what recursion is, just remember 
>>>       
>> the answer.
>>     
>>>     Otherwise, find someone who is standing closer to
>>>     Douglas Hofstadter than you are; then ask him or her what
>>>     recursion is." - Andrew Plotkin
>>>
>>>       
>> --
>> Gaetano Bellanca - Department of Engineering - University of 
>> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice 
>> (VoIP): +39 0532 974809 Fax: +39 0532 974870 
>> mailto:gaetano.bellanca at unife.it
>>
>> L'istruzione costa? Stanno provando con l'ignoranza!
>>
>>
>>     
>
>
>   

-- 
Gaetano Bellanca - Department of Engineering - University of Ferrara
Via Saragat, 1 - 44100 - Ferrara - ITALY
Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
mailto:gaetano.bellanca at unife.it

L'istruzione costa? Stanno provando con l'ignoranza!



More information about the mpich-discuss mailing list