[mpich-discuss] Problems running mpi application on differentCPUs

Rajeev Thakur thakur at mcs.anl.gov
Mon Sep 28 17:01:22 CDT 2009


ch3:sock won't perform as well as ch3:nemesis though.
 
Rajeev


  _____  

From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Cye Stoner
Sent: Monday, September 28, 2009 4:32 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Problems running mpi application on
differentCPUs


When deploying MPICH2 to a small cluster, I noticed that many had
problems with the "--with-device=ch3:nemesis"
Try using the "--with-device=ch3:sock" interface instead.
 
Cye


On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur <thakur at mcs.anl.gov>
wrote:


Try using the mpdcheck utility to debug as described in the appendix of
the installation guide. Pick one client and the server.


Rajeev

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> Gaetano Bellanca

> Sent: Monday, September 28, 2009 6:00 AM
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Problems running mpi application
> on different CPUs
>
> Dear Rajeev,
>
> thanks for your help. I disabled the firewall on the server (the only
> one running) and tried with any other combination.
> All the clients together are running correctly. The same for the
> processors on the server separately.
> The problem is only when I mix processes on the server and on
> the client.
>
> When I run mpdtrace on the server, all the CPUs are
> responding correctly.
> The same happens if I run in parallel 'hostname'
>
> Probably, it is a problem of my code, but it works on a cluster of 10
> Pentium IV PEs.
> I discover a 'strange behavior':
> 1) running the code with the server as a first machine of the
> pool, the
> code hangs with the previously communicated error;
> 2) if I put the server as a second machine of the pool, the
> code starts
> and works regularly up to the writing procedures, opens the
> first file
> and then remains indefinitely waiting for something;
>
> Should I compile mpich2 with some particular communicator? I have
> nemesis, at the moment.
> I'm using this for mpich2 compilation:
> ./configure --prefix=/opt/mpich2/1.1/intel11.1 --enable-cxx
> --enable-f90
> --enable-fast --enable-traceback --with-mpe --enable-f90modules
> --enable-cache --enable-romio --with-file-system=nfs+ufs+pvfs2
> --with-device=ch3:nemesis --with-pvfs2=/usr/local
> --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/ --with-pm=mpd:hydra
>
> Regards
>
> Gaetano
>
> Rajeev Thakur ha scritto:
> > Try running on smaller subsets of the machines to debug the
> problem. It
> > is possible that a process on some machine cannot connect to another
> > because of some firewall settings.
> >
> > Rajeev
> >
> >
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> Gaetano Bellanca
> >> Sent: Saturday, September 26, 2009 7:10 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: [mpich-discuss] Problems running mpi application on
> >> different CPUs
> >>
> >> Hi,
> >>
> >> I'm sorry but  I posted with a wrong Object my previous message!!!
> >>
> >> I have a small cluster of
> >> a) 1 server: dual processor / quad core Intel(R) Xeon(R) CPU  E5345
> >> b) 4 clients: single processor / dual core Intel(R)
> Core(TM)2 Duo CPU
> >> E8400 connected  with a 1Gbit/s ethernet network.
> >>
> >> I compiled mpich2-1.1.1p1 on the first system (a) and
> share mpich on
> >> the other computers via nfs. I have mpd running as a root
> on all the
> >> computers (ubunt 8.04 . kernel 2.6.24-24-server)
> >>
> >> When I run my code in parallel on the first system, it works
> >> correctly; the same happens running the same code  in
> parallel on the
> >> other computers (always running the code from the server). On the
> >> contrary, running the code using processors from both the
> server and
> >> the clients at the same time with the command:
> >>
> >> mpiexec -machinefile machinefile -n 4 my_parallel_code
> >>
> >> I receive this error message:
> >>
> >> Fatal error in MPI_Init: Other MPI error, error stack:
> >> MPIR_Init_thread(394): Initialization failed
> >> (unknown)(): Other MPI error
> >> rank 3 in job 8  c1_4545   caused collective abort of all ranks
> >>  exit status of rank 3: return code 1
> >>
> >> Should I use some particular flags in compilation or at run time?
> >>
> >> Regards.
> >>
> >> Gaetano
> >>
> >> --
> >> Gaetano Bellanca - Department of Engineering - University
> of Ferrara
> >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP): +39 0532
> >> 974809 Fax: +39 0532 974870 mailto:gaetano.bellanca at unife.it
> >>
> >> L'istruzione costa? Stanno provando con l'ignoranza!
> >>
> >>
> >>
> >
> >
> >
>
> --
> Gaetano Bellanca - Department of Engineering - University of Ferrara
> Via Saragat, 1 - 44100 - Ferrara - ITALY
> Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
> mailto:gaetano.bellanca at unife.it
>
> L'istruzione costa? Stanno provando con l'ignoranza!
>
>
>






-- 
"If you already know what recursion is, just remember the answer.
Otherwise, find someone who is standing closer to
Douglas Hofstadter than you are; then ask him or her what recursion is."
- Andrew Plotkin


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090928/d21df95f/attachment-0001.htm>


More information about the mpich-discuss mailing list