[mpich-discuss] Problems running mpi applicationon differentCPUs

Rajeev Thakur thakur at mcs.anl.gov
Tue Sep 29 10:07:20 CDT 2009


Try running the cpi example from the mpich2/examples directory. Try
running some of the tests in test/mpi.

Rajeev 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> Gaetano Bellanca
> Sent: Tuesday, September 29, 2009 9:13 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Problems running mpi 
> applicationon differentCPUs
> 
> Dear Rajeev,
> 
> I tested as indicated in the appendix of the installation 
> guide using mpdcheck (and also changing the device using 
> ch3:nemesis (thank you for the suggestion, Cye)) but nothing 
> changes in the behavior of the code.
> However I noted that, changing the machinefile for not to 
> have the server machine as a first one, I have different 
> behaviors. In particular running with mpiexec -machinefile 
> my_machinefile -n 6 my_parallel_code and:
> 1) machinefile as follows
> server
> server
> client1
> client2
> client3
> client4
> client5
> 
> the code starts with the previous error
> Fatal error in MPI_Init: Other MPI error, error stack:
>  > >> MPIR_Init_thread(394): Initialization failed  > >> 
> (unknown)(): Other MPI error
>  > >> rank 3 in job 8  c1_4545   caused collective abort of all ranks
>  > >>  exit status of rank 3: return code 1
> 
> 2) machinefile as follows
> client2
> client3
> client4
> client5
> server
> server
> 
> the code starts with a SIGSEGV segmentation fault at the line 
> of the MPI_INIT
> 
> 3) machinefile as follows
> client2
> client3
> server
> server
> client4
> client5
> 
> the code starts regularly, but stops working in the first 
> file writing procedure.
> It produces a file of 0 bytes, but does not advance in any 
> other procedure, and I have to kill to terminate.
> 
> Could it be something relevant to a synchronization/timeout 
> between the different machines?
> On another cluster (all Pentium IV 3GHz), the same program is 
> slower to start when launched, but everything works fine.
> 
> Regards.
> 
> Gaetano
> 
> 
> 
> 
> 
> Rajeev Thakur ha scritto:
> > ch3:sock won't perform as well as ch3:nemesis though.
> >  
> > Rajeev
> >
> >     
> --------------------------------------------------------------
> ----------
> >     *From:* mpich-discuss-bounces at mcs.anl.gov
> >     [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf 
> Of *Cye Stoner
> >     *Sent:* Monday, September 28, 2009 4:32 PM
> >     *To:* mpich-discuss at mcs.anl.gov
> >     *Subject:* Re: [mpich-discuss] Problems running mpi 
> application on
> >     differentCPUs
> >
> >     When deploying MPICH2 to a small cluster, I noticed 
> that many had
> >     problems with the "--with-device=ch3:nemesis"
> >     Try using the "--with-device=ch3:sock" interface instead.
> >      
> >     Cye
> >
> >     On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
> >     <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
> >
> >         Try using the mpdcheck utility to debug as described in the
> >         appendix of
> >         the installation guide. Pick one client and the server.
> >
> >         Rajeev
> >
> >         > -----Original Message-----
> >         > From: mpich-discuss-bounces at mcs.anl.gov
> >         <mailto:mpich-discuss-bounces at mcs.anl.gov>
> >         > [mailto:mpich-discuss-bounces at mcs.anl.gov
> >         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> >         > Gaetano Bellanca
> >         > Sent: Monday, September 28, 2009 6:00 AM
> >         > Cc: mpich-discuss at mcs.anl.gov 
> <mailto:mpich-discuss at mcs.anl.gov>
> >         > Subject: Re: [mpich-discuss] Problems running mpi 
> application
> >         > on different CPUs
> >         >
> >         > Dear Rajeev,
> >         >
> >         > thanks for your help. I disabled the firewall on 
> the server
> >         (the only
> >         > one running) and tried with any other combination.
> >         > All the clients together are running correctly. 
> The same for the
> >         > processors on the server separately.
> >         > The problem is only when I mix processes on the 
> server and on
> >         > the client.
> >         >
> >         > When I run mpdtrace on the server, all the CPUs are
> >         > responding correctly.
> >         > The same happens if I run in parallel 'hostname'
> >         >
> >         > Probably, it is a problem of my code, but it works on a
> >         cluster of 10
> >         > Pentium IV PEs.
> >         > I discover a 'strange behavior':
> >         > 1) running the code with the server as a first 
> machine of the
> >         > pool, the
> >         > code hangs with the previously communicated error;
> >         > 2) if I put the server as a second machine of the 
> pool, the
> >         > code starts
> >         > and works regularly up to the writing procedures, 
> opens the
> >         > first file
> >         > and then remains indefinitely waiting for something;
> >         >
> >         > Should I compile mpich2 with some particular 
> communicator? I
> >         have
> >         > nemesis, at the moment.
> >         > I'm using this for mpich2 compilation:
> >         > ./configure --prefix=/opt/mpich2/1.1/intel11.1 
> --enable-cxx
> >         > --enable-f90
> >         > --enable-fast --enable-traceback --with-mpe 
> --enable-f90modules
> >         > --enable-cache --enable-romio 
> --with-file-system=nfs+ufs+pvfs2
> >         > --with-device=ch3:nemesis --with-pvfs2=/usr/local
> >         > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
> >         --with-pm=mpd:hydra
> >         >
> >         > Regards
> >         >
> >         > Gaetano
> >         >
> >         > Rajeev Thakur ha scritto:
> >         > > Try running on smaller subsets of the machines 
> to debug the
> >         > problem. It
> >         > > is possible that a process on some machine 
> cannot connect
> >         to another
> >         > > because of some firewall settings.
> >         > >
> >         > > Rajeev
> >         > >
> >         > >
> >         > >> -----Original Message-----
> >         > >> From: mpich-discuss-bounces at mcs.anl.gov
> >         <mailto:mpich-discuss-bounces at mcs.anl.gov>
> >         > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
> >         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> >         > Gaetano Bellanca
> >         > >> Sent: Saturday, September 26, 2009 7:10 AM
> >         > >> To: mpich-discuss at mcs.anl.gov
> >         <mailto:mpich-discuss at mcs.anl.gov>
> >         > >> Subject: [mpich-discuss] Problems running mpi 
> application on
> >         > >> different CPUs
> >         > >>
> >         > >> Hi,
> >         > >>
> >         > >> I'm sorry but  I posted with a wrong Object my previous
> >         message!!!
> >         > >>
> >         > >> I have a small cluster of
> >         > >> a) 1 server: dual processor / quad core 
> Intel(R) Xeon(R)
> >         CPU  E5345
> >         > >> b) 4 clients: single processor / dual core Intel(R)
> >         > Core(TM)2 Duo CPU
> >         > >> E8400 connected  with a 1Gbit/s ethernet network.
> >         > >>
> >         > >> I compiled mpich2-1.1.1p1 on the first system (a) and
> >         > share mpich on
> >         > >> the other computers via nfs. I have mpd 
> running as a root
> >         > on all the
> >         > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
> >         > >>
> >         > >> When I run my code in parallel on the first 
> system, it works
> >         > >> correctly; the same happens running the same code  in
> >         > parallel on the
> >         > >> other computers (always running the code from the
> >         server). On the
> >         > >> contrary, running the code using processors 
> from both the
> >         > server and
> >         > >> the clients at the same time with the command:
> >         > >>
> >         > >> mpiexec -machinefile machinefile -n 4 my_parallel_code
> >         > >>
> >         > >> I receive this error message:
> >         > >>
> >         > >> Fatal error in MPI_Init: Other MPI error, error stack:
> >         > >> MPIR_Init_thread(394): Initialization failed
> >         > >> (unknown)(): Other MPI error
> >         > >> rank 3 in job 8  c1_4545   caused collective 
> abort of all
> >         ranks
> >         > >>  exit status of rank 3: return code 1
> >         > >>
> >         > >> Should I use some particular flags in compilation or at
> >         run time?
> >         > >>
> >         > >> Regards.
> >         > >>
> >         > >> Gaetano
> >         > >>
> >         > >> --
> >         > >> Gaetano Bellanca - Department of Engineering - 
> University
> >         > of Ferrara
> >         > >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP):
> >         +39 0532
> >         > >> 974809 Fax: +39 0532 974870
> >         mailto:gaetano.bellanca at unife.it
> >         <mailto:gaetano.bellanca at unife.it>
> >         > >>
> >         > >> L'istruzione costa? Stanno provando con l'ignoranza!
> >         > >>
> >         > >>
> >         > >>
> >         > >
> >         > >
> >         > >
> >         >
> >         > --
> >         > Gaetano Bellanca - Department of Engineering - 
> University of
> >         Ferrara
> >         > Via Saragat, 1 - 44100 - Ferrara - ITALY
> >         > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> >         > mailto:gaetano.bellanca at unife.it
> >         <mailto:gaetano.bellanca at unife.it>
> >         >
> >         > L'istruzione costa? Stanno provando con l'ignoranza!
> >         >
> >         >
> >         >
> >
> >
> >
> >
> >     -- 
> >     "If you already know what recursion is, just remember 
> the answer.
> >     Otherwise, find someone who is standing closer to
> >     Douglas Hofstadter than you are; then ask him or her what
> >     recursion is." - Andrew Plotkin
> >
> 
> --
> Gaetano Bellanca - Department of Engineering - University of 
> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice 
> (VoIP): +39 0532 974809 Fax: +39 0532 974870 
> mailto:gaetano.bellanca at unife.it
> 
> L'istruzione costa? Stanno provando con l'ignoranza!
> 
> 



More information about the mpich-discuss mailing list