[mpich-discuss] Problems running mpi application on different CPUs
Cye Stoner
stonerc at gmail.com
Mon Sep 28 16:32:26 CDT 2009
When deploying MPICH2 to a small cluster, I noticed that many had problems
with the "--with-device=ch3:nemesis"
Try using the "--with-device=ch3:sock" interface instead.
Cye
On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> Try using the mpdcheck utility to debug as described in the appendix of
> the installation guide. Pick one client and the server.
>
> Rajeev
>
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> > Gaetano Bellanca
> > Sent: Monday, September 28, 2009 6:00 AM
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] Problems running mpi application
> > on different CPUs
> >
> > Dear Rajeev,
> >
> > thanks for your help. I disabled the firewall on the server (the only
> > one running) and tried with any other combination.
> > All the clients together are running correctly. The same for the
> > processors on the server separately.
> > The problem is only when I mix processes on the server and on
> > the client.
> >
> > When I run mpdtrace on the server, all the CPUs are
> > responding correctly.
> > The same happens if I run in parallel 'hostname'
> >
> > Probably, it is a problem of my code, but it works on a cluster of 10
> > Pentium IV PEs.
> > I discover a 'strange behavior':
> > 1) running the code with the server as a first machine of the
> > pool, the
> > code hangs with the previously communicated error;
> > 2) if I put the server as a second machine of the pool, the
> > code starts
> > and works regularly up to the writing procedures, opens the
> > first file
> > and then remains indefinitely waiting for something;
> >
> > Should I compile mpich2 with some particular communicator? I have
> > nemesis, at the moment.
> > I'm using this for mpich2 compilation:
> > ./configure --prefix=/opt/mpich2/1.1/intel11.1 --enable-cxx
> > --enable-f90
> > --enable-fast --enable-traceback --with-mpe --enable-f90modules
> > --enable-cache --enable-romio --with-file-system=nfs+ufs+pvfs2
> > --with-device=ch3:nemesis --with-pvfs2=/usr/local
> > --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/ --with-pm=mpd:hydra
> >
> > Regards
> >
> > Gaetano
> >
> > Rajeev Thakur ha scritto:
> > > Try running on smaller subsets of the machines to debug the
> > problem. It
> > > is possible that a process on some machine cannot connect to another
> > > because of some firewall settings.
> > >
> > > Rajeev
> > >
> > >
> > >> -----Original Message-----
> > >> From: mpich-discuss-bounces at mcs.anl.gov
> > >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> > Gaetano Bellanca
> > >> Sent: Saturday, September 26, 2009 7:10 AM
> > >> To: mpich-discuss at mcs.anl.gov
> > >> Subject: [mpich-discuss] Problems running mpi application on
> > >> different CPUs
> > >>
> > >> Hi,
> > >>
> > >> I'm sorry but I posted with a wrong Object my previous message!!!
> > >>
> > >> I have a small cluster of
> > >> a) 1 server: dual processor / quad core Intel(R) Xeon(R) CPU E5345
> > >> b) 4 clients: single processor / dual core Intel(R)
> > Core(TM)2 Duo CPU
> > >> E8400 connected with a 1Gbit/s ethernet network.
> > >>
> > >> I compiled mpich2-1.1.1p1 on the first system (a) and
> > share mpich on
> > >> the other computers via nfs. I have mpd running as a root
> > on all the
> > >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
> > >>
> > >> When I run my code in parallel on the first system, it works
> > >> correctly; the same happens running the same code in
> > parallel on the
> > >> other computers (always running the code from the server). On the
> > >> contrary, running the code using processors from both the
> > server and
> > >> the clients at the same time with the command:
> > >>
> > >> mpiexec -machinefile machinefile -n 4 my_parallel_code
> > >>
> > >> I receive this error message:
> > >>
> > >> Fatal error in MPI_Init: Other MPI error, error stack:
> > >> MPIR_Init_thread(394): Initialization failed
> > >> (unknown)(): Other MPI error
> > >> rank 3 in job 8 c1_4545 caused collective abort of all ranks
> > >> exit status of rank 3: return code 1
> > >>
> > >> Should I use some particular flags in compilation or at run time?
> > >>
> > >> Regards.
> > >>
> > >> Gaetano
> > >>
> > >> --
> > >> Gaetano Bellanca - Department of Engineering - University
> > of Ferrara
> > >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP): +39 0532
> > >> 974809 Fax: +39 0532 974870 mailto:gaetano.bellanca at unife.it
> > >>
> > >> L'istruzione costa? Stanno provando con l'ignoranza!
> > >>
> > >>
> > >>
> > >
> > >
> > >
> >
> > --
> > Gaetano Bellanca - Department of Engineering - University of Ferrara
> > Via Saragat, 1 - 44100 - Ferrara - ITALY
> > Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> > mailto:gaetano.bellanca at unife.it
> >
> > L'istruzione costa? Stanno provando con l'ignoranza!
> >
> >
> >
>
>
--
"If you already know what recursion is, just remember the answer. Otherwise,
find someone who is standing closer to
Douglas Hofstadter than you are; then ask him or her what recursion is." -
Andrew Plotkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090928/3f35ece3/attachment.htm>
More information about the mpich-discuss
mailing list