[mpich-discuss] Problems running mpi application on different CPUs

Gaetano Bellanca gaetano.bellanca at unife.it
Mon Sep 28 05:59:36 CDT 2009


Dear Rajeev,

thanks for your help. I disabled the firewall on the server (the only 
one running) and tried with any other combination.
All the clients together are running correctly. The same for the 
processors on the server separately.
The problem is only when I mix processes on the server and on the client.

When I run mpdtrace on the server, all the CPUs are responding correctly.
The same happens if I run in parallel 'hostname'

Probably, it is a problem of my code, but it works on a cluster of 10 
Pentium IV PEs.
I discover a 'strange behavior':
1) running the code with the server as a first machine of the pool, the 
code hangs with the previously communicated error;
2) if I put the server as a second machine of the pool, the code starts 
and works regularly up to the writing procedures, opens the first file 
and then remains indefinitely waiting for something;

Should I compile mpich2 with some particular communicator? I have 
nemesis, at the moment.
I'm using this for mpich2 compilation:
./configure --prefix=/opt/mpich2/1.1/intel11.1 --enable-cxx --enable-f90 
--enable-fast --enable-traceback --with-mpe --enable-f90modules 
--enable-cache --enable-romio --with-file-system=nfs+ufs+pvfs2 
--with-device=ch3:nemesis --with-pvfs2=/usr/local 
--with-java=/usr/lib/jvm/java-6-sun-1.6.0.07/ --with-pm=mpd:hydra

Regards

Gaetano

Rajeev Thakur ha scritto:
> Try running on smaller subsets of the machines to debug the problem. It
> is possible that a process on some machine cannot connect to another
> because of some firewall settings.
>
> Rajeev
>
>  
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gaetano Bellanca
>> Sent: Saturday, September 26, 2009 7:10 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: [mpich-discuss] Problems running mpi application on 
>> different CPUs
>>
>> Hi,
>>
>> I'm sorry but  I posted with a wrong Object my previous message!!!
>>
>> I have a small cluster of
>> a) 1 server: dual processor / quad core Intel(R) Xeon(R) CPU  E5345
>> b) 4 clients: single processor / dual core Intel(R) Core(TM)2 Duo CPU 
>> E8400 connected  with a 1Gbit/s ethernet network.
>>
>> I compiled mpich2-1.1.1p1 on the first system (a) and share mpich on 
>> the other computers via nfs. I have mpd running as a root on all the 
>> computers (ubunt 8.04 . kernel 2.6.24-24-server)
>>
>> When I run my code in parallel on the first system, it works 
>> correctly; the same happens running the same code  in parallel on the 
>> other computers (always running the code from the server). On the 
>> contrary, running the code using processors from both the server and 
>> the clients at the same time with the command:
>>
>> mpiexec -machinefile machinefile -n 4 my_parallel_code
>>
>> I receive this error message:
>>
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> MPIR_Init_thread(394): Initialization failed
>> (unknown)(): Other MPI error
>> rank 3 in job 8  c1_4545   caused collective abort of all ranks
>>  exit status of rank 3: return code 1
>>
>> Should I use some particular flags in compilation or at run time?
>>
>> Regards.
>>
>> Gaetano
>>
>> -- 
>> Gaetano Bellanca - Department of Engineering - University of Ferrara 
>> Via Saragat, 1 - 44100 - Ferrara - ITALY Voice (VoIP): +39 0532 
>> 974809 Fax: +39 0532 974870 mailto:gaetano.bellanca at unife.it
>>
>> L'istruzione costa? Stanno provando con l'ignoranza!
>>
>>
>>     
>
>
>   

-- 
Gaetano Bellanca - Department of Engineering - University of Ferrara
Via Saragat, 1 - 44100 - Ferrara - ITALY
Voice (VoIP): +39 0532 974809 Fax: +39 0532 974870
mailto:gaetano.bellanca at unife.it

L'istruzione costa? Stanno provando con l'ignoranza!




More information about the mpich-discuss mailing list