[mpich-discuss] Problems running mpi application on differentCPUs

Gaetano Bellanca gaetano.bellanca at unife.it
Tue Sep 29 09:13:24 CDT 2009


Dear Rajeev,

I tested as indicated in the appendix of the installation guide using 
mpdcheck (and also changing the device using ch3:nemesis (thank you for 
the suggestion, Cye)) but nothing changes in the behavior of the code.
However I noted that, changing the machinefile for not to have the 
server machine as a first one, I have different behaviors. In particular 
running with
mpiexec -machinefile my_machinefile -n 6 my_parallel_code and:
1) machinefile as follows
server
server
client1
client2
client3
client4
client5

the code starts with the previous error
Fatal error in MPI_Init: Other MPI error, error stack:
 > >> MPIR_Init_thread(394): Initialization failed
 > >> (unknown)(): Other MPI error
 > >> rank 3 in job 8  c1_4545   caused collective abort of all ranks
 > >>  exit status of rank 3: return code 1

2) machinefile as follows
client2
client3
client4
client5
server
server

the code starts with a SIGSEGV segmentation fault at the line of the 
MPI_INIT

3) machinefile as follows
client2
client3
server
server
client4
client5

the code starts regularly, but stops working in the first file writing 
procedure.
It produces a file of 0 bytes, but does not advance in any other 
procedure, and I have to kill to terminate.

Could it be something relevant to a synchronization/timeout between the 
different machines?
On another cluster (all Pentium IV 3GHz), the same program is slower to 
start when launched, but everything works fine.

Regards.

Gaetano





Rajeev Thakur ha scritto:
> ch3:sock won't perform as well as ch3:nemesis though.
>  
> Rajeev
>
>     ------------------------------------------------------------------------
>     *From:* mpich-discuss-bounces at mcs.anl.gov
>     [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *Cye Stoner
>     *Sent:* Monday, September 28, 2009 4:32 PM
>     *To:* mpich-discuss at mcs.anl.gov
>     *Subject:* Re: [mpich-discuss] Problems running mpi application on
>     differentCPUs
>
>     When deploying MPICH2 to a small cluster, I noticed that many had
>     problems with the "--with-device=ch3:nemesis"
>     Try using the "--with-device=ch3:sock" interface instead.
>      
>     Cye
>
>     On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
>     <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>
>         Try using the mpdcheck utility to debug as described in the
>         appendix of
>         the installation guide. Pick one client and the server.
>
>         Rajeev
>
>         > -----Original Message-----
>         > From: mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>         > [mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>         > Gaetano Bellanca
>         > Sent: Monday, September 28, 2009 6:00 AM
>         > Cc: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>         > Subject: Re: [mpich-discuss] Problems running mpi application
>         > on different CPUs
>         >
>         > Dear Rajeev,
>         >
>         > thanks for your help. I disabled the firewall on the server
>         (the only
>         > one running) and tried with any other combination.
>         > All the clients together are running correctly. The same for the
>         > processors on the server separately.
>         > The problem is only when I mix processes on the server and on
>         > the client.
>         >
>         > When I run mpdtrace on the server, all the CPUs are
>         > responding correctly.
>         > The same happens if I run in parallel 'hostname'
>         >
>         > Probably, it is a problem of my code, but it works on a
>         cluster of 10
>         > Pentium IV PEs.
>         > I discover a 'strange behavior':
>         > 1) running the code with the server as a first machine of the
>         > pool, the
>         > code hangs with the previously communicated error;
>         > 2) if I put the server as a second machine of the pool, the
>         > code starts
>         > and works regularly up to the writing procedures, opens the
>         > first file
>         > and then remains indefinitely waiting for something;
>         >
>         > Should I compile mpich2 with some particular communicator? I
>         have
>         > nemesis, at the moment.
>         > I'm using this for mpich2 compilation:
>         > ./configure --prefix=/opt/mpich2/1.1/intel11.1 --enable-cxx
>         > --enable-f90
>         > --enable-fast --enable-traceback --with-mpe --enable-f90modules
>         > --enable-cache --enable-romio --with-file-system=nfs+ufs+pvfs2
>         > --with-device=ch3:nemesis --with-pvfs2=/usr/local
>         > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
>         --with-pm=mpd:hydra
>         >
>         > Regards
>         >
>         > Gaetano
>         >
>         > Rajeev Thakur ha scritto:
>         > > Try running on smaller subsets of the machines to debug the
>         > problem. It
>         > > is possible that a process on some machine cannot connect
>         to another
>         > > because of some firewall settings.
>         > >
>         > > Rajeev
>         > >
>         > >
>         > >> -----Original Message-----
>         > >> From: mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>
>         > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
>         <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
>         > Gaetano Bellanca
>         > >> Sent: Saturday, September 26, 2009 7:10 AM
>         > >> To: mpich-discuss at mcs.anl.gov
>         <mailto:mpich-discuss at mcs.anl.gov>
>         > >> Subject: [mpich-discuss] Problems running mpi application on
>         > >> different CPUs
>         > >>
>         > >> Hi,
>         > >>
>         > >> I'm sorry but  I posted with a wrong Object my previous
>         message!!!
>         > >>
>         > >> I have a small cluster of
>         > >> a) 1 server: dual processor / quad core Intel(R) Xeon(R)
>         CPU  E5345
>         > >> b) 4 clients: single processor / dual core Intel(R)
>         > Core(TM)2 Duo CPU
>         > >> E8400 connected  with a 1Gbit/s ethernet network.
>         > >>
>         > >> I compiled mpich2-1.1.1p1 on the first system (a) and
>         > share mpich on
>         > >> the other computers via nfs. I have mpd running as a root
>         > on all the
>         > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
>         > >>
>         > >> When I run my code in parallel on the first system, it works
>         > >> correctly; the same happens running the same code  in
>         > parallel on the
>         > >> other computers (always running the code from the
>         server). On the
>         > >> contrary, running the code using processors from both the
>         > server and
>         > >> the clients at the same time with the command:
>         > >>
>         > >> mpiexec -machinefile machinefile -n 4 my_parallel_code
>         > >>
>         > >> I receive this error message:
>         > >>
>         > >> Fatal error in MPI_Init: Other MPI error, error stack:
>         > >> MPIR_Init_thread(394): Initialization failed
>         > >> (unknown)(): Other MPI error
>         > >> rank 3 in job 8  c1_4545   caused collective abort of all
>         ranks
>         > >>  exit status of rank 3: return code 1
>         > >>
>         > >> Should I use some particular flags in compilation or at
>         run time?
>         > >>
>         > >> Regards.
>         > >>
>         > >> Gaetano
>         > >>
>         > >> --
>         > >> Gaetano Bellanca - Department of Engineering - University
>         > of Ferrara
>         > >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP):
>         +39 0532
>         > >> 974809 Fax: +39 0532 974870
>         mailto:gaetano.bellanca at unife.it
>         <mailto:gaetano.bellanca at unife.it>
>         > >>
>         > >> L'istruzione costa? Stanno provando con l'ignoranza!
>         > >>
>         > >>
>         > >>
>         > >
>         > >
>         > >
>         >
>         > --
>         > Gaetano Bellanca - Department of Engineering - University of
>         Ferrara
>         > Via Saragat, 1 - 44100 - Ferrara - ITALY
>         > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
>         > mailto:gaetano.bellanca at unife.it
>         <mailto:gaetano.bellanca at unife.it>
>         >
>         > L'istruzione costa? Stanno provando con l'ignoranza!
>         >
>         >
>         >
>
>
>
>
>     -- 
>     "If you already know what recursion is, just remember the answer.
>     Otherwise, find someone who is standing closer to
>     Douglas Hofstadter than you are; then ask him or her what
>     recursion is." - Andrew Plotkin
>

-- 
Gaetano Bellanca - Department of Engineering - University of Ferrara
Via Saragat, 1 - 44100 - Ferrara - ITALY
Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
mailto:gaetano.bellanca at unife.it

L'istruzione costa? Stanno provando con l'ignoranza!



More information about the mpich-discuss mailing list