[mpich-discuss] Problems running mpi applicationon differentCPUs
Rajeev Thakur
thakur at mcs.anl.gov
Sat Oct 3 17:22:50 CDT 2009
Just run a few from the test/mpi/coll directory by hand. Run make in
that directory, then do mpiexec -n 5 name_of_executable. If they run,
there may be a bug in your code.
Rajeev
> -----Original Message-----
> From: Gaetano Bellanca [mailto:gaetano.bellanca at unife.it]
> Sent: Saturday, October 03, 2009 2:58 AM
> To: Rajeev Thakur
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Problems running mpi
> applicationon differentCPUs
>
> Dear Rajeev
>
> cpi test (and other tests in examples directory) works
> without problems.
> I tried to run make in the test directory and had this error message:
>
> make[2]: Entering directory
> `/home/bellanca/software/mpich2/1.1.1p1/mpich2-1.1.1p1_intel/t
> est/mpid/ch3'
> make[2]: *** No rule to make target `../../../lib/lib.a',
> needed by `reorder'. Stop.
>
> How should I run the tests from the ../test directory?
> I tried with make testing, but I had a lot of unexpected
> output from mpd
>
> Regards
>
> Gaetano
>
> Rajeev Thakur ha scritto:
> > Try running the cpi example from the mpich2/examples directory. Try
> > running some of the tests in test/mpi.
> >
> > Rajeev
> >
> >
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gaetano
> >> Bellanca
> >> Sent: Tuesday, September 29, 2009 9:13 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] Problems running mpi applicationon
> >> differentCPUs
> >>
> >> Dear Rajeev,
> >>
> >> I tested as indicated in the appendix of the installation
> guide using
> >> mpdcheck (and also changing the device using ch3:nemesis
> (thank you
> >> for the suggestion, Cye)) but nothing changes in the
> behavior of the
> >> code.
> >> However I noted that, changing the machinefile for not to have the
> >> server machine as a first one, I have different behaviors. In
> >> particular running with mpiexec -machinefile my_machinefile -n 6
> >> my_parallel_code and:
> >> 1) machinefile as follows
> >> server
> >> server
> >> client1
> >> client2
> >> client3
> >> client4
> >> client5
> >>
> >> the code starts with the previous error Fatal error in MPI_Init:
> >> Other MPI error, error stack:
> >> > >> MPIR_Init_thread(394): Initialization failed > >>
> >> (unknown)(): Other MPI error
> >> > >> rank 3 in job 8 c1_4545 caused collective abort
> of all ranks
> >> > >> exit status of rank 3: return code 1
> >>
> >> 2) machinefile as follows
> >> client2
> >> client3
> >> client4
> >> client5
> >> server
> >> server
> >>
> >> the code starts with a SIGSEGV segmentation fault at the
> line of the
> >> MPI_INIT
> >>
> >> 3) machinefile as follows
> >> client2
> >> client3
> >> server
> >> server
> >> client4
> >> client5
> >>
> >> the code starts regularly, but stops working in the first file
> >> writing procedure.
> >> It produces a file of 0 bytes, but does not advance in any other
> >> procedure, and I have to kill to terminate.
> >>
> >> Could it be something relevant to a
> synchronization/timeout between
> >> the different machines?
> >> On another cluster (all Pentium IV 3GHz), the same program
> is slower
> >> to start when launched, but everything works fine.
> >>
> >> Regards.
> >>
> >> Gaetano
> >>
> >>
> >>
> >>
> >>
> >> Rajeev Thakur ha scritto:
> >>
> >>> ch3:sock won't perform as well as ch3:nemesis though.
> >>>
> >>> Rajeev
> >>>
> >>>
> >>>
> >> --------------------------------------------------------------
> >> ----------
> >>
> >>> *From:* mpich-discuss-bounces at mcs.anl.gov
> >>> [mailto:mpich-discuss-bounces at mcs.anl.gov] *On Behalf
> >>>
> >> Of *Cye Stoner
> >>
> >>> *Sent:* Monday, September 28, 2009 4:32 PM
> >>> *To:* mpich-discuss at mcs.anl.gov
> >>> *Subject:* Re: [mpich-discuss] Problems running mpi
> >>>
> >> application on
> >>
> >>> differentCPUs
> >>>
> >>> When deploying MPICH2 to a small cluster, I noticed
> >>>
> >> that many had
> >>
> >>> problems with the "--with-device=ch3:nemesis"
> >>> Try using the "--with-device=ch3:sock" interface instead.
> >>>
> >>> Cye
> >>>
> >>> On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur
> >>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
> >>>
> >>> Try using the mpdcheck utility to debug as
> described in the
> >>> appendix of
> >>> the installation guide. Pick one client and the server.
> >>>
> >>> Rajeev
> >>>
> >>> > -----Original Message-----
> >>> > From: mpich-discuss-bounces at mcs.anl.gov
> >>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
> >>> > [mailto:mpich-discuss-bounces at mcs.anl.gov
> >>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> >>> > Gaetano Bellanca
> >>> > Sent: Monday, September 28, 2009 6:00 AM
> >>> > Cc: mpich-discuss at mcs.anl.gov
> >>>
> >> <mailto:mpich-discuss at mcs.anl.gov>
> >>
> >>> > Subject: Re: [mpich-discuss] Problems running mpi
> >>>
> >> application
> >>
> >>> > on different CPUs
> >>> >
> >>> > Dear Rajeev,
> >>> >
> >>> > thanks for your help. I disabled the firewall on
> >>>
> >> the server
> >>
> >>> (the only
> >>> > one running) and tried with any other combination.
> >>> > All the clients together are running correctly.
> >>>
> >> The same for the
> >>
> >>> > processors on the server separately.
> >>> > The problem is only when I mix processes on the
> >>>
> >> server and on
> >>
> >>> > the client.
> >>> >
> >>> > When I run mpdtrace on the server, all the CPUs are
> >>> > responding correctly.
> >>> > The same happens if I run in parallel 'hostname'
> >>> >
> >>> > Probably, it is a problem of my code, but it works on a
> >>> cluster of 10
> >>> > Pentium IV PEs.
> >>> > I discover a 'strange behavior':
> >>> > 1) running the code with the server as a first
> >>>
> >> machine of the
> >>
> >>> > pool, the
> >>> > code hangs with the previously communicated error;
> >>> > 2) if I put the server as a second machine of the
> >>>
> >> pool, the
> >>
> >>> > code starts
> >>> > and works regularly up to the writing procedures,
> >>>
> >> opens the
> >>
> >>> > first file
> >>> > and then remains indefinitely waiting for something;
> >>> >
> >>> > Should I compile mpich2 with some particular
> >>>
> >> communicator? I
> >>
> >>> have
> >>> > nemesis, at the moment.
> >>> > I'm using this for mpich2 compilation:
> >>> > ./configure --prefix=/opt/mpich2/1.1/intel11.1
> >>>
> >> --enable-cxx
> >>
> >>> > --enable-f90
> >>> > --enable-fast --enable-traceback --with-mpe
> >>>
> >> --enable-f90modules
> >>
> >>> > --enable-cache --enable-romio
> >>>
> >> --with-file-system=nfs+ufs+pvfs2
> >>
> >>> > --with-device=ch3:nemesis --with-pvfs2=/usr/local
> >>> > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/
> >>> --with-pm=mpd:hydra
> >>> >
> >>> > Regards
> >>> >
> >>> > Gaetano
> >>> >
> >>> > Rajeev Thakur ha scritto:
> >>> > > Try running on smaller subsets of the machines
> >>>
> >> to debug the
> >>
> >>> > problem. It
> >>> > > is possible that a process on some machine
> >>>
> >> cannot connect
> >>
> >>> to another
> >>> > > because of some firewall settings.
> >>> > >
> >>> > > Rajeev
> >>> > >
> >>> > >
> >>> > >> -----Original Message-----
> >>> > >> From: mpich-discuss-bounces at mcs.anl.gov
> >>> <mailto:mpich-discuss-bounces at mcs.anl.gov>
> >>> > >> [mailto:mpich-discuss-bounces at mcs.anl.gov
> >>> <mailto:mpich-discuss-bounces at mcs.anl.gov>] On Behalf Of
> >>> > Gaetano Bellanca
> >>> > >> Sent: Saturday, September 26, 2009 7:10 AM
> >>> > >> To: mpich-discuss at mcs.anl.gov
> >>> <mailto:mpich-discuss at mcs.anl.gov>
> >>> > >> Subject: [mpich-discuss] Problems running mpi
> >>>
> >> application on
> >>
> >>> > >> different CPUs
> >>> > >>
> >>> > >> Hi,
> >>> > >>
> >>> > >> I'm sorry but I posted with a wrong Object
> my previous
> >>> message!!!
> >>> > >>
> >>> > >> I have a small cluster of
> >>> > >> a) 1 server: dual processor / quad core
> >>>
> >> Intel(R) Xeon(R)
> >>
> >>> CPU E5345
> >>> > >> b) 4 clients: single processor / dual core Intel(R)
> >>> > Core(TM)2 Duo CPU
> >>> > >> E8400 connected with a 1Gbit/s ethernet network.
> >>> > >>
> >>> > >> I compiled mpich2-1.1.1p1 on the first system (a) and
> >>> > share mpich on
> >>> > >> the other computers via nfs. I have mpd
> >>>
> >> running as a root
> >>
> >>> > on all the
> >>> > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
> >>> > >>
> >>> > >> When I run my code in parallel on the first
> >>>
> >> system, it works
> >>
> >>> > >> correctly; the same happens running the same code in
> >>> > parallel on the
> >>> > >> other computers (always running the code from the
> >>> server). On the
> >>> > >> contrary, running the code using processors
> >>>
> >> from both the
> >>
> >>> > server and
> >>> > >> the clients at the same time with the command:
> >>> > >>
> >>> > >> mpiexec -machinefile machinefile -n 4
> my_parallel_code
> >>> > >>
> >>> > >> I receive this error message:
> >>> > >>
> >>> > >> Fatal error in MPI_Init: Other MPI error,
> error stack:
> >>> > >> MPIR_Init_thread(394): Initialization failed
> >>> > >> (unknown)(): Other MPI error
> >>> > >> rank 3 in job 8 c1_4545 caused collective
> >>>
> >> abort of all
> >>
> >>> ranks
> >>> > >> exit status of rank 3: return code 1
> >>> > >>
> >>> > >> Should I use some particular flags in
> compilation or at
> >>> run time?
> >>> > >>
> >>> > >> Regards.
> >>> > >>
> >>> > >> Gaetano
> >>> > >>
> >>> > >> --
> >>> > >> Gaetano Bellanca - Department of Engineering -
> >>>
> >> University
> >>
> >>> > of Ferrara
> >>> > >> Via Saragat, 1 - 44100 - Ferrara - ITALY
> Voice (VoIP):
> >>> +39 0532
> >>> > >> 974809 Fax: +39 0532 974870
> >>> mailto:gaetano.bellanca at unife.it
> >>> <mailto:gaetano.bellanca at unife.it>
> >>> > >>
> >>> > >> L'istruzione costa? Stanno provando con l'ignoranza!
> >>> > >>
> >>> > >>
> >>> > >>
> >>> > >
> >>> > >
> >>> > >
> >>> >
> >>> > --
> >>> > Gaetano Bellanca - Department of Engineering -
> >>>
> >> University of
> >>
> >>> Ferrara
> >>> > Via Saragat, 1 - 44100 - Ferrara - ITALY
> >>> > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> >>> > mailto:gaetano.bellanca at unife.it
> >>> <mailto:gaetano.bellanca at unife.it>
> >>> >
> >>> > L'istruzione costa? Stanno provando con l'ignoranza!
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> "If you already know what recursion is, just remember
> >>>
> >> the answer.
> >>
> >>> Otherwise, find someone who is standing closer to
> >>> Douglas Hofstadter than you are; then ask him or her what
> >>> recursion is." - Andrew Plotkin
> >>>
> >>>
> >> --
> >> Gaetano Bellanca - Department of Engineering - University
> of Ferrara
> >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
> >> (VoIP): +39 0532 974809 Fax: +39 0532 974870
> >> mailto:gaetano.bellanca at unife.it
> >>
> >> L'istruzione costa? Stanno provando con l'ignoranza!
> >>
> >>
> >>
> >
> >
> >
>
> --
> Gaetano Bellanca - Department of Engineering - University of
> Ferrara Via Saragat, 1 - 44100 - Ferrara - ITALY Voice
> (VoIP): +39 0532 974809 Fax: +39 0532 974870
> mailto:gaetano.bellanca at unife.it
>
> L'istruzione costa? Stanno provando con l'ignoranza!
>
>
More information about the mpich-discuss
mailing list