[mpich-discuss] Problems running mpi applicationon differentCPUs
Rajeev Thakur
thakur at mcs.anl.gov
Tue Sep 29 10:07:20 CDT 2009
Try running the cpi example from the mpich2/examples directory. Try
running some of the tests in test/mpi.
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> Gaetano Bellanca
> Sent: Tuesday, September 29, 2009 9:13 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Problems running mpi
> applicationon differentCPUs
>
> Dear Rajeev,
>
> I tested as indicated in the appendix of the installation
> guide using mpdcheck (and also changing the device using
> ch3:nemesis (thank you for the suggestion, Cye)) but nothing
> changes in the behavior of the code.
> However I noted that, changing the machinefile for not to
> have the server machine as a first one, I have different
> behaviors. In particular running with mpiexec -machinefile
> my_machinefile -n 6 my_parallel_code and:
> 1) machinefile as follows
> server
> server
> client1
> client2
> client3
> client4
> client5
>
> the code starts with the previous error
> Fatal error in MPI_Init: Other MPI error, error stack:
> > >> MPIR_Init_thread(394): Initialization failed > >>
> (unknown)(): Other MPI error
> > >> rank 3 in job 8 c1_4545 caused collective abort of all ranks
> > >> exit status of rank 3: return code 1
>
> 2) machinefile as follows
> client2
> client3
> client4
> client5
> server
> server
>
> the code starts with a SIGSEGV segmentation fault at the line
> of the MPI_INIT
>
> 3) machinefile as follows
> client2
> client3
> server
> server
> client4
> client5
>
> the code starts regularly, but stops working in the first
> file writing procedure.
> It produces a file of 0 bytes, but does not advance in any
> other procedure, and I have to kill to terminate.
>
> Could it be something relevant to a synchronization/timeout
> between the different machines?
> On another cluster (all Pentium IV 3GHz), the same program is
> slower to start when launched, but everything works fine.
>
> Regards.
>
> Gaetano
>
>
>
>
>
> Rajeev Thakur ha scritto:
> > ch3:sock won't perform as well as ch3:nemesis though.
> >
> > Rajeev
> >
> >
> --------------------------------------------------------------
> ----------
> > *From:* mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf
> Of *Cye Stoner
> > *Sent:* Monday, September 28, 2009 4:32 PM
> > *To:* mpich-discuss at mcs.anl.gov
> > *Subject:* Re: [mpich-discuss] Problems running mpi
> application on
> > differentCPUs
> >
> > When deploying MPICH2 to a small cluster, I noticed
> that many had
> > problems with the "--with-device=ch3:nemesis"
> > Try using the "--with-device=ch3:sock" interface instead.
> >
> > Cye
> >
> > On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
> > <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
> >
> > Try using the mpdcheck utility to debug as described in the
> > appendix of
> > the installation guide. Pick one client and the server.
> >
> > Rajeev
> >
> > > -----Original Message-----
> > > From: mpich-discuss-bounces at mcs.anl.gov
> > <mailto:mpich-discuss-bounces at mcs.anl.gov>
> > > [mailto:mpich-discuss-bounces at mcs.anl.gov
> > <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> > > Gaetano Bellanca
> > > Sent: Monday, September 28, 2009 6:00 AM
> > > Cc: mpich-discuss at mcs.anl.gov
> <mailto:mpich-discuss at mcs.anl.gov>
> > > Subject: Re: [mpich-discuss] Problems running mpi
> application
> > > on different CPUs
> > >
> > > Dear Rajeev,
> > >
> > > thanks for your help. I disabled the firewall on
> the server
> > (the only
> > > one running) and tried with any other combination.
> > > All the clients together are running correctly.
> The same for the
> > > processors on the server separately.
> > > The problem is only when I mix processes on the
> server and on
> > > the client.
> > >
> > > When I run mpdtrace on the server, all the CPUs are
> > > responding correctly.
> > > The same happens if I run in parallel 'hostname'
> > >
> > > Probably, it is a problem of my code, but it works on a
> > cluster of 10
> > > Pentium IV PEs.
> > > I discover a 'strange behavior':
> > > 1) running the code with the server as a first
> machine of the
> > > pool, the
> > > code hangs with the previously communicated error;
> > > 2) if I put the server as a second machine of the
> pool, the
> > > code starts
> > > and works regularly up to the writing procedures,
> opens the
> > > first file
> > > and then remains indefinitely waiting for something;
> > >
> > > Should I compile mpich2 with some particular
> communicator? I
> > have
> > > nemesis, at the moment.
> > > I'm using this for mpich2 compilation:
> > > ./configure --prefix=/opt/mpich2/1.1/intel11.1
> --enable-cxx
> > > --enable-f90
> > > --enable-fast --enable-traceback --with-mpe
> --enable-f90modules
> > > --enable-cache --enable-romio
> --with-file-system=nfs+ufs+pvfs2
> > > --with-device=ch3:nemesis --with-pvfs2=/usr/local
> > > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
> > --with-pm=mpd:hydra
> > >
> > > Regards
> > >
> > > Gaetano
> > >
> > > Rajeev Thakur ha scritto:
> > > > Try running on smaller subsets of the machines
> to debug the
> > > problem. It
> > > > is possible that a process on some machine
> cannot connect
> > to another
> > > > because of some firewall settings.
> > > >
> > > > Rajeev
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: mpich-discuss-bounces at mcs.anl.gov
> > <mailto:mpich-discuss-bounces at mcs.anl.gov>
> > > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
> > <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> > > Gaetano Bellanca
> > > >> Sent: Saturday, September 26, 2009 7:10 AM
> > > >> To: mpich-discuss at mcs.anl.gov
> > <mailto:mpich-discuss at mcs.anl.gov>
> > > >> Subject: [mpich-discuss] Problems running mpi
> application on
> > > >> different CPUs
> > > >>
> > > >> Hi,
> > > >>
> > > >> I'm sorry but I posted with a wrong Object my previous
> > message!!!
> > > >>
> > > >> I have a small cluster of
> > > >> a) 1 server: dual processor / quad core
> Intel(R) Xeon(R)
> > CPU E5345
> > > >> b) 4 clients: single processor / dual core Intel(R)
> > > Core(TM)2 Duo CPU
> > > >> E8400 connected with a 1Gbit/s ethernet network.
> > > >>
> > > >> I compiled mpich2-1.1.1p1 on the first system (a) and
> > > share mpich on
> > > >> the other computers via nfs. I have mpd
> running as a root
> > > on all the
> > > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
> > > >>
> > > >> When I run my code in parallel on the first
> system, it works
> > > >> correctly; the same happens running the same code in
> > > parallel on the
> > > >> other computers (always running the code from the
> > server). On the
> > > >> contrary, running the code using processors
> from both the
> > > server and
> > > >> the clients at the same time with the command:
> > > >>
> > > >> mpiexec -machinefile machinefile -n 4 my_parallel_code
> > > >>
> > > >> I receive this error message:
> > > >>
> > > >> Fatal error in MPI_Init: Other MPI error, error stack:
> > > >> MPIR_Init_thread(394): Initialization failed
> > > >> (unknown)(): Other MPI error
> > > >> rank 3 in job 8 c1_4545 caused collective
> abort of all
> > ranks
> > > >> exit status of rank 3: return code 1
> > > >>
> > > >> Should I use some particular flags in compilation or at
> > run time?
> > > >>
> > > >> Regards.
> > > >>
> > > >> Gaetano
> > > >>
> > > >> --
> > > >> Gaetano Bellanca - Department of Engineering -
> University
> > > of Ferrara
> > > >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP):
> > +39 0532
> > > >> 974809 Fax: +39 0532 974870
> > mailto:gaetano.bellanca at unife.it
> > <mailto:gaetano.bellanca at unife.it>
> > > >>
> > > >> L'istruzione costa? Stanno provando con l'ignoranza!
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > >
> > > --
> > > Gaetano Bellanca - Department of Engineering -
> University of
> > Ferrara
> > > Via Saragat, 1 - 44100 - Ferrara - ITALY
> > > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> > > mailto:gaetano.bellanca at unife.it
> > <mailto:gaetano.bellanca at unife.it>
> > >
> > > L'istruzione costa? Stanno provando con l'ignoranza!
> > >
> > >
> > >
> >
> >
> >
> >
> > --
> > "If you already know what recursion is, just remember
> the answer.
> > Otherwise, find someone who is standing closer to
> > Douglas Hofstadter than you are; then ask him or her what
> > recursion is." - Andrew Plotkin
> >
>
> --
> Gaetano Bellanca - Department of Engineering - University of
> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
> (VoIP): +39 0532 974809 Fax: +39 0532 974870
> mailto:gaetano.bellanca at unife.it
>
> L'istruzione costa? Stanno provando con l'ignoranza!
>
>
More information about the mpich-discuss
mailing list