[mpich-discuss] Strange behavior

Gus Correa gus at ldeo.columbia.edu
Mon Sep 29 09:55:00 CDT 2008


Hello Gaetano and list

Just a guess.
I saw a similar behavior long ago on a program from a user here
which sent very long messages (tens to hundreds of megabytes).
It was the complementary case to yours,
process 0 would broadcast big data arrays to the other processes, 
instead of gathering the data from them.
The program would fail on a busy cluster,  or would take forever to 
complete when use was low.
There was no "small dataset" test case to use, as you have, though.

I presume MPI had problems to buffer all that data.
I guess MPI message size doesn't scale all the way up to very big things,
and where it breaks may be dependent on how much memory you have, the 
system, etc.

Splitting the messages into smaller chunks,
and using a communication loop controlled by the actual size of the data 
being passed
(the larger the total data size, the larger the number of loop iterations,
to keep the message size reasonably small)
solved the problem.
Of course there are more sophisticated ways to implement this
using MPI data types to split the data/messages, etc.

I hope this helps,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Bellanca Gaetano wrote:

> Hi,
>
> I'm running a simulation on a Linux Box (ubuntu 8 with kernel 
> 2.6.24-21-generic).
> It is a fortran program compiled with Intel 10.1 and mpich2-1.0.7.
> The CPU is an AMD 64 X2 Dual Core 5600+
>
> The simulation works correctly with a small data set, and I can use 1, 
> 2, 4 ... processes (mpiexec -n #) to emulate a cluster, but when the 
> data set is increased, the simulation runs only if I use more than 4 
> or 6 processes (the number depends on the dimension of the data set; a 
> bigger data set requires an increased number of processes).
>  
> It stops in a send-receive communication between all the PEs and PE0, 
> and stops also if I use mpi_gatherv.
> But, very strange, if I stop mpd (which starts at boot time) and 
> restart it manually, the simulation works without any error with the 
> same data set !
> If I run the simulation in a linux cluster, I have the same behavior 
> except that, also restarting mpd, the simulation doesn't work if I 
> don't use enough processes.
>
> Any idea?
>
> Regards.
>
> gaetano
>
> ------------------------------------------------------------------------
> Gaetano Bellanca - Department of Engineering - University of Ferrara 
> Via Saragat, 1 - 44100 - Ferrara - ITALY            
> Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
> mailto:gaetano.bellanca at unife.it
> ------------------------------------------------------------------------





More information about the mpich-discuss mailing list