[mpich-discuss] Problems running mpi applicationon differentCPUs

Rajeev Thakur thakur at mcs.anl.gov
Sat Oct 3 17:22:50 CDT 2009


Just run a few from the test/mpi/coll directory by hand. Run make in
that directory, then do mpiexec -n 5 name_of_executable. If they run,
there may be a bug in your code.

Rajeev 

> -----Original Message-----
> From: Gaetano Bellanca [mailto:gaetano.bellanca at unife.it] 
> Sent: Saturday, October 03, 2009 2:58 AM
> To: Rajeev Thakur
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Problems running mpi 
> applicationon differentCPUs
> 
> Dear Rajeev
> 
> cpi test (and other tests in examples directory) works 
> without problems.
> I tried to run make in the test directory and had this error message:
> 
> make[2]: Entering directory
> `/home/bellanca/software/mpich2/1.1.1p1/mpich2-1.1.1p1_intel/t
> est/mpid/ch3'
> make[2]: *** No rule to make target `../../../lib/lib.a', 
> needed by `reorder'.  Stop.
> 
> How should I run the tests from the  ../test directory?
> I tried with make testing, but I had a lot of unexpected 
> output from mpd
> 
> Regards
> 
> Gaetano
> 
> Rajeev Thakur ha scritto:
> > Try running the cpi example from the mpich2/examples directory. Try 
> > running some of the tests in test/mpi.
> >
> > Rajeev
> >
> >   
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov 
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gaetano 
> >> Bellanca
> >> Sent: Tuesday, September 29, 2009 9:13 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] Problems running mpi applicationon 
> >> differentCPUs
> >>
> >> Dear Rajeev,
> >>
> >> I tested as indicated in the appendix of the installation 
> guide using 
> >> mpdcheck (and also changing the device using ch3:nemesis 
> (thank you 
> >> for the suggestion, Cye)) but nothing changes in the 
> behavior of the 
> >> code.
> >> However I noted that, changing the machinefile for not to have the 
> >> server machine as a first one, I have different behaviors. In 
> >> particular running with mpiexec -machinefile my_machinefile -n 6 
> >> my_parallel_code and:
> >> 1) machinefile as follows
> >> server
> >> server
> >> client1
> >> client2
> >> client3
> >> client4
> >> client5
> >>
> >> the code starts with the previous error Fatal error in MPI_Init: 
> >> Other MPI error, error stack:
> >>  > >> MPIR_Init_thread(394): Initialization failed  > >>
> >> (unknown)(): Other MPI error
> >>  > >> rank 3 in job 8  c1_4545   caused collective abort 
> of all ranks
> >>  > >>  exit status of rank 3: return code 1
> >>
> >> 2) machinefile as follows
> >> client2
> >> client3
> >> client4
> >> client5
> >> server
> >> server
> >>
> >> the code starts with a SIGSEGV segmentation fault at the 
> line of the 
> >> MPI_INIT
> >>
> >> 3) machinefile as follows
> >> client2
> >> client3
> >> server
> >> server
> >> client4
> >> client5
> >>
> >> the code starts regularly, but stops working in the first file 
> >> writing procedure.
> >> It produces a file of 0 bytes, but does not advance in any other 
> >> procedure, and I have to kill to terminate.
> >>
> >> Could it be something relevant to a 
> synchronization/timeout between 
> >> the different machines?
> >> On another cluster (all Pentium IV 3GHz), the same program 
> is slower 
> >> to start when launched, but everything works fine.
> >>
> >> Regards.
> >>
> >> Gaetano
> >>
> >>
> >>
> >>
> >>
> >> Rajeev Thakur ha scritto:
> >>     
> >>> ch3:sock won't perform as well as ch3:nemesis though.
> >>>  
> >>> Rajeev
> >>>
> >>>     
> >>>       
> >> --------------------------------------------------------------
> >> ----------
> >>     
> >>>     *From:* mpich-discuss-bounces at mcs.anl.gov
> >>>     [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf
> >>>       
> >> Of *Cye Stoner
> >>     
> >>>     *Sent:* Monday, September 28, 2009 4:32 PM
> >>>     *To:* mpich-discuss at mcs.anl.gov
> >>>     *Subject:* Re: [mpich-discuss] Problems running mpi
> >>>       
> >> application on
> >>     
> >>>     differentCPUs
> >>>
> >>>     When deploying MPICH2 to a small cluster, I noticed
> >>>       
> >> that many had
> >>     
> >>>     problems with the "--with-device=ch3:nemesis"
> >>>     Try using the "--with-device=ch3:sock" interface instead.
> >>>      
> >>>     Cye
> >>>
> >>>     On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
> >>>     <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
> >>>
> >>>         Try using the mpdcheck utility to debug as 
> described in the
> >>>         appendix of
> >>>         the installation guide. Pick one client and the server.
> >>>
> >>>         Rajeev
> >>>
> >>>         > -----Original Message-----
> >>>         > From: mpich-discuss-bounces at mcs.anl.gov
> >>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
> >>>         > [mailto:mpich-discuss-bounces at mcs.anl.gov
> >>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> >>>         > Gaetano Bellanca
> >>>         > Sent: Monday, September 28, 2009 6:00 AM
> >>>         > Cc: mpich-discuss at mcs.anl.gov
> >>>       
> >> <mailto:mpich-discuss at mcs.anl.gov>
> >>     
> >>>         > Subject: Re: [mpich-discuss] Problems running mpi
> >>>       
> >> application
> >>     
> >>>         > on different CPUs
> >>>         >
> >>>         > Dear Rajeev,
> >>>         >
> >>>         > thanks for your help. I disabled the firewall on
> >>>       
> >> the server
> >>     
> >>>         (the only
> >>>         > one running) and tried with any other combination.
> >>>         > All the clients together are running correctly. 
> >>>       
> >> The same for the
> >>     
> >>>         > processors on the server separately.
> >>>         > The problem is only when I mix processes on the
> >>>       
> >> server and on
> >>     
> >>>         > the client.
> >>>         >
> >>>         > When I run mpdtrace on the server, all the CPUs are
> >>>         > responding correctly.
> >>>         > The same happens if I run in parallel 'hostname'
> >>>         >
> >>>         > Probably, it is a problem of my code, but it works on a
> >>>         cluster of 10
> >>>         > Pentium IV PEs.
> >>>         > I discover a 'strange behavior':
> >>>         > 1) running the code with the server as a first
> >>>       
> >> machine of the
> >>     
> >>>         > pool, the
> >>>         > code hangs with the previously communicated error;
> >>>         > 2) if I put the server as a second machine of the
> >>>       
> >> pool, the
> >>     
> >>>         > code starts
> >>>         > and works regularly up to the writing procedures,
> >>>       
> >> opens the
> >>     
> >>>         > first file
> >>>         > and then remains indefinitely waiting for something;
> >>>         >
> >>>         > Should I compile mpich2 with some particular
> >>>       
> >> communicator? I
> >>     
> >>>         have
> >>>         > nemesis, at the moment.
> >>>         > I'm using this for mpich2 compilation:
> >>>         > ./configure --prefix=/opt/mpich2/1.1/intel11.1
> >>>       
> >> --enable-cxx
> >>     
> >>>         > --enable-f90
> >>>         > --enable-fast --enable-traceback --with-mpe
> >>>       
> >> --enable-f90modules
> >>     
> >>>         > --enable-cache --enable-romio
> >>>       
> >> --with-file-system=nfs+ufs+pvfs2
> >>     
> >>>         > --with-device=ch3:nemesis --with-pvfs2=/usr/local
> >>>         > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
> >>>         --with-pm=mpd:hydra
> >>>         >
> >>>         > Regards
> >>>         >
> >>>         > Gaetano
> >>>         >
> >>>         > Rajeev Thakur ha scritto:
> >>>         > > Try running on smaller subsets of the machines
> >>>       
> >> to debug the
> >>     
> >>>         > problem. It
> >>>         > > is possible that a process on some machine
> >>>       
> >> cannot connect
> >>     
> >>>         to another
> >>>         > > because of some firewall settings.
> >>>         > >
> >>>         > > Rajeev
> >>>         > >
> >>>         > >
> >>>         > >> -----Original Message-----
> >>>         > >> From: mpich-discuss-bounces at mcs.anl.gov
> >>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
> >>>         > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
> >>>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> >>>         > Gaetano Bellanca
> >>>         > >> Sent: Saturday, September 26, 2009 7:10 AM
> >>>         > >> To: mpich-discuss at mcs.anl.gov
> >>>         <mailto:mpich-discuss at mcs.anl.gov>
> >>>         > >> Subject: [mpich-discuss] Problems running mpi
> >>>       
> >> application on
> >>     
> >>>         > >> different CPUs
> >>>         > >>
> >>>         > >> Hi,
> >>>         > >>
> >>>         > >> I'm sorry but  I posted with a wrong Object 
> my previous
> >>>         message!!!
> >>>         > >>
> >>>         > >> I have a small cluster of
> >>>         > >> a) 1 server: dual processor / quad core
> >>>       
> >> Intel(R) Xeon(R)
> >>     
> >>>         CPU  E5345
> >>>         > >> b) 4 clients: single processor / dual core Intel(R)
> >>>         > Core(TM)2 Duo CPU
> >>>         > >> E8400 connected  with a 1Gbit/s ethernet network.
> >>>         > >>
> >>>         > >> I compiled mpich2-1.1.1p1 on the first system (a) and
> >>>         > share mpich on
> >>>         > >> the other computers via nfs. I have mpd
> >>>       
> >> running as a root
> >>     
> >>>         > on all the
> >>>         > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
> >>>         > >>
> >>>         > >> When I run my code in parallel on the first
> >>>       
> >> system, it works
> >>     
> >>>         > >> correctly; the same happens running the same code  in
> >>>         > parallel on the
> >>>         > >> other computers (always running the code from the
> >>>         server). On the
> >>>         > >> contrary, running the code using processors
> >>>       
> >> from both the
> >>     
> >>>         > server and
> >>>         > >> the clients at the same time with the command:
> >>>         > >>
> >>>         > >> mpiexec -machinefile machinefile -n 4 
> my_parallel_code
> >>>         > >>
> >>>         > >> I receive this error message:
> >>>         > >>
> >>>         > >> Fatal error in MPI_Init: Other MPI error, 
> error stack:
> >>>         > >> MPIR_Init_thread(394): Initialization failed
> >>>         > >> (unknown)(): Other MPI error
> >>>         > >> rank 3 in job 8  c1_4545   caused collective 
> >>>       
> >> abort of all
> >>     
> >>>         ranks
> >>>         > >>  exit status of rank 3: return code 1
> >>>         > >>
> >>>         > >> Should I use some particular flags in 
> compilation or at
> >>>         run time?
> >>>         > >>
> >>>         > >> Regards.
> >>>         > >>
> >>>         > >> Gaetano
> >>>         > >>
> >>>         > >> --
> >>>         > >> Gaetano Bellanca - Department of Engineering -
> >>>       
> >> University
> >>     
> >>>         > of Ferrara
> >>>         > >> Via Saragat, 1 - 44100 - Ferrara - ITALY 
> Voice (VoIP):
> >>>         +39 0532
> >>>         > >> 974809 Fax: +39 0532 974870
> >>>         mailto:gaetano.bellanca at unife.it
> >>>         <mailto:gaetano.bellanca at unife.it>
> >>>         > >>
> >>>         > >> L'istruzione costa? Stanno provando con l'ignoranza!
> >>>         > >>
> >>>         > >>
> >>>         > >>
> >>>         > >
> >>>         > >
> >>>         > >
> >>>         >
> >>>         > --
> >>>         > Gaetano Bellanca - Department of Engineering -
> >>>       
> >> University of
> >>     
> >>>         Ferrara
> >>>         > Via Saragat, 1 - 44100 - Ferrara - ITALY
> >>>         > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> >>>         > mailto:gaetano.bellanca at unife.it
> >>>         <mailto:gaetano.bellanca at unife.it>
> >>>         >
> >>>         > L'istruzione costa? Stanno provando con l'ignoranza!
> >>>         >
> >>>         >
> >>>         >
> >>>
> >>>
> >>>
> >>>
> >>>     -- 
> >>>     "If you already know what recursion is, just remember
> >>>       
> >> the answer.
> >>     
> >>>     Otherwise, find someone who is standing closer to
> >>>     Douglas Hofstadter than you are; then ask him or her what
> >>>     recursion is." - Andrew Plotkin
> >>>
> >>>       
> >> --
> >> Gaetano Bellanca - Department of Engineering - University 
> of Ferrara 
> >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
> >> (VoIP): +39 0532 974809 Fax: +39 0532 974870 
> >> mailto:gaetano.bellanca at unife.it
> >>
> >> L'istruzione costa? Stanno provando con l'ignoranza!
> >>
> >>
> >>     
> >
> >
> >   
> 
> --
> Gaetano Bellanca - Department of Engineering - University of 
> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice 
> (VoIP): +39 0532 974809 Fax: +39 0532 974870 
> mailto:gaetano.bellanca at unife.it
> 
> L'istruzione costa? Stanno provando con l'ignoranza!
> 
> 



More information about the mpich-discuss mailing list