<div>When deploying MPICH2 to a small cluster, I noticed that many had problems with the "--with-device=ch3:nemesis"</div>
<div>Try using the "--with-device=ch3:sock" interface instead.</div>
<div> </div>
<div>Cye<br><br></div>
<div class="gmail_quote">On Mon, Sep 28, 2009 at 12:13 PM, Rajeev Thakur <span dir="ltr"><<a href="mailto:thakur@mcs.anl.gov">thakur@mcs.anl.gov</a>></span> wrote:<br>
<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">Try using the mpdcheck utility to debug as described in the appendix of<br>the installation guide. Pick one client and the server.<br>
<div class="im"><br>Rajeev<br><br>> -----Original Message-----<br>> From: <a href="mailto:mpich-discuss-bounces@mcs.anl.gov">mpich-discuss-bounces@mcs.anl.gov</a><br>> [mailto:<a href="mailto:mpich-discuss-bounces@mcs.anl.gov">mpich-discuss-bounces@mcs.anl.gov</a>] On Behalf Of<br>
> Gaetano Bellanca<br></div>
<div>
<div></div>
<div class="h5">> Sent: Monday, September 28, 2009 6:00 AM<br>> Cc: <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>> Subject: Re: [mpich-discuss] Problems running mpi application<br>
> on different CPUs<br>><br>> Dear Rajeev,<br>><br>> thanks for your help. I disabled the firewall on the server (the only<br>> one running) and tried with any other combination.<br>> All the clients together are running correctly. The same for the<br>
> processors on the server separately.<br>> The problem is only when I mix processes on the server and on<br>> the client.<br>><br>> When I run mpdtrace on the server, all the CPUs are<br>> responding correctly.<br>
> The same happens if I run in parallel 'hostname'<br>><br>> Probably, it is a problem of my code, but it works on a cluster of 10<br>> Pentium IV PEs.<br>> I discover a 'strange behavior':<br>
> 1) running the code with the server as a first machine of the<br>> pool, the<br>> code hangs with the previously communicated error;<br>> 2) if I put the server as a second machine of the pool, the<br>> code starts<br>
> and works regularly up to the writing procedures, opens the<br>> first file<br>> and then remains indefinitely waiting for something;<br>><br>> Should I compile mpich2 with some particular communicator? I have<br>
> nemesis, at the moment.<br>> I'm using this for mpich2 compilation:<br>> ./configure --prefix=/opt/mpich2/1.1/intel11.1 --enable-cxx<br>> --enable-f90<br>> --enable-fast --enable-traceback --with-mpe --enable-f90modules<br>
> --enable-cache --enable-romio --with-file-system=nfs+ufs+pvfs2<br>> --with-device=ch3:nemesis --with-pvfs2=/usr/local<br>> --with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/ --with-pm=mpd:hydra<br>><br>> Regards<br>
><br>> Gaetano<br>><br>> Rajeev Thakur ha scritto:<br>> > Try running on smaller subsets of the machines to debug the<br>> problem. It<br>> > is possible that a process on some machine cannot connect to another<br>
> > because of some firewall settings.<br>> ><br>> > Rajeev<br>> ><br>> ><br>> >> -----Original Message-----<br>> >> From: <a href="mailto:mpich-discuss-bounces@mcs.anl.gov">mpich-discuss-bounces@mcs.anl.gov</a><br>
> >> [mailto:<a href="mailto:mpich-discuss-bounces@mcs.anl.gov">mpich-discuss-bounces@mcs.anl.gov</a>] On Behalf Of<br>> Gaetano Bellanca<br>> >> Sent: Saturday, September 26, 2009 7:10 AM<br>> >> To: <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> >> Subject: [mpich-discuss] Problems running mpi application on<br>> >> different CPUs<br>> >><br>> >> Hi,<br>> >><br>> >> I'm sorry but I posted with a wrong Object my previous message!!!<br>
> >><br>> >> I have a small cluster of<br>> >> a) 1 server: dual processor / quad core Intel(R) Xeon(R) CPU E5345<br>> >> b) 4 clients: single processor / dual core Intel(R)<br>> Core(TM)2 Duo CPU<br>
> >> E8400 connected with a 1Gbit/s ethernet network.<br>> >><br>> >> I compiled mpich2-1.1.1p1 on the first system (a) and<br>> share mpich on<br>> >> the other computers via nfs. I have mpd running as a root<br>
> on all the<br>> >> computers (ubunt 8.04 . kernel 2.6.24-24-server)<br>> >><br>> >> When I run my code in parallel on the first system, it works<br>> >> correctly; the same happens running the same code in<br>
> parallel on the<br>> >> other computers (always running the code from the server). On the<br>> >> contrary, running the code using processors from both the<br>> server and<br>> >> the clients at the same time with the command:<br>
> >><br>> >> mpiexec -machinefile machinefile -n 4 my_parallel_code<br>> >><br>> >> I receive this error message:<br>> >><br>> >> Fatal error in MPI_Init: Other MPI error, error stack:<br>
> >> MPIR_Init_thread(394): Initialization failed<br>> >> (unknown)(): Other MPI error<br>> >> rank 3 in job 8 c1_4545 caused collective abort of all ranks<br>> >> exit status of rank 3: return code 1<br>
> >><br>> >> Should I use some particular flags in compilation or at run time?<br>> >><br>> >> Regards.<br>> >><br>> >> Gaetano<br>> >><br>> >> --<br>
> >> Gaetano Bellanca - Department of Engineering - University<br>> of Ferrara<br>> >> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP): +39 0532<br>> >> 974809 Fax: +39 0532 974870 mailto:<a href="mailto:gaetano.bellanca@unife.it">gaetano.bellanca@unife.it</a><br>
> >><br>> >> L'istruzione costa? Stanno provando con l'ignoranza!<br>> >><br>> >><br>> >><br>> ><br>> ><br>> ><br>><br>> --<br>> Gaetano Bellanca - Department of Engineering - University of Ferrara<br>
> Via Saragat, 1 - 44100 - Ferrara - ITALY<br>> Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870<br>> mailto:<a href="mailto:gaetano.bellanca@unife.it">gaetano.bellanca@unife.it</a><br>><br>> L'istruzione costa? Stanno provando con l'ignoranza!<br>
><br>><br>><br><br></div></div></blockquote></div><br><br clear="all">
<div></div><br>-- <br>"If you already know what recursion is, just remember the answer. Otherwise, find someone who is standing closer to<br>Douglas Hofstadter than you are; then ask him or her what recursion is." - Andrew Plotkin<br>