[mpich-discuss] Strange behavior of mpd

Gus Correa gus at ldeo.columbia.edu
Mon Sep 29 19:50:00 CDT 2008


Hi Gaetano  and list

Please, note that I didn't suggest increasing the MPI buffer size.
What I suggested instead was to reduce the message size.
For instance, if you are passing big 3D array chunks now, say with a 
single mpi_gatherv call,
change the code to pass a bunch of smaller 2D array slices, also with 
mpi_gatherv.
This can be done in a loop, and you can progressively rebuild the
"global" (process 0) 3D array from the "global" 2D slices
retrieved by process 0 on each mpi_gatherv iteration on the loop.
It takes some coding, but it is not too hard.
A more sophisticated version would use MPI vector types to do the same.

You say that with a small problem size (dataset size) the program runs fine,
but breaks with a big problem size.
Since the failure depends on the problem size,
I thought the cause should be due to some scaling assumption that is 
being violated.
Hence, my previous suggestion to reduce the message size.
Message size cannot scale indefinitely on the large side.

As I said, I did have a similar problem here, and breaking the
messages down into smaller chunks fixed it.
MPI probably cannot provide buffers of arbitrary size, because of memory 
size, etc.
The fact that for a large problem size, when you launch a few processes 
the program fails,
but when you launch more processes it works,
reinforces the idea that message size is playing a role.
Because with more processes the message size is probably smaller (say, 
if your problem
is of domain-decomposition type).
And smaller messages are easier for MPI and the computers to "digest".

How big are your messages, how many of them are there,
and how much memory is available?
Can you monitor memory use while the program runs, say, using top?
I suggest that you do this with a small dataset, then with a big 
dataset, and compare.
When I had this kind of problem here,
there was a lot of memory swapping going on, and the program got stuck.
When the messages were made smaller, the swapping disappeared, and the 
program worked.

Well, this doesn't mean that the mpd boot startup scripts on your 
machine are not wrong, which  they may be.
However, as Rajeev pointed out, mpd doesn't interfere with MPI 
communication,
and the interaction with the problem you see may be fortuitous.
I only use mpd on standalone workstations, I launch it by hand, so I 
can't say much about it.
I don't Ubuntu either (Fedora and CentOS reign here).
On a cluster I use the torque job manager and the corresponding  
mpiexec, not mpd.

Also, on your AMD 64 X2 Dual Core 5600+ Linux box you are 
oversubscribing the processors/cores.
In my experience, this is fine with small messages (in the MPI "hello, 
world" program you
can oversubscribe to tens of processes and it runs well), but 
oversubscription
doesn't scale well with message size, and breaks for large messages.

I hope this helps,
Gus Correa

---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Rajeev Thakur wrote:

> It is odd because mpd is not involved in the communication. 
> Communication takes place directly between the MPI processes.
>  
> Rajeev
>
>     ------------------------------------------------------------------------
>     *From:* owner-mpich-discuss at mcs.anl.gov
>     [mailto:owner-mpich-discuss at mcs.anl.gov] *On Behalf Of *Bellanca
>     Gaetano
>     *Sent:* Monday, September 29, 2008 3:48 PM
>     *To:* mpich-discuss at mcs.anl.gov
>     *Subject:* Re: [mpich-discuss] Strange behavior of mpd
>
>     Hi Gus,
>
>     I thought to something similar, and modify/increase the dimension
>     of the buffer for mpi. But this didn't solve the problem.
>     Moreover, I had the same problem on my PC at home, and on a two
>     CPU system at the University, where there are 16GB of memory.
>
>     Moreover, in both the configuration the simulation work once I
>     have restarted mpd after the boot and also using mpd not launched
>     at boot time (with root privileges) but started by the user before
>     running the simulation.
>     Note that also running mpd  after  the bot with root privileges
>     results in a correct behavior of the code.
>
>     I really don't know: it seems something relevant to the boot-time
>     startup.
>     For the moment, I use the solution of mpd launched by the user,
>     but the one with mpd started by the root at boot time is more
>     'elegant'.
>
>     Regards.
>
>     Gaetano
>
>     ------------------------------------------------------------------------
>     Gaetano Bellanca - Department of Engineering - University of Ferrara 
>     Via Saragat, 1 - 44100 - Ferrara - ITALY            
>     Voice (VoIP):  +39 0532 974809     Fax:  +39 0532 974870
>     mailto:gaetano.bellanca at unife.it
>     ------------------------------------------------------------------------
>
Hello Gaetano and list

Just a guess.
I saw a similar behavior long ago on a program from a user here
which sent very long messages (tens to hundreds of megabytes).
It was the complementary case to yours,
process 0 would broadcast big data arrays to the other processes, 
instead of gathering the data from them.
The program would fail on a busy cluster,  or would take forever to 
complete when use was low.
There was no "small dataset" test case to use, as you have, though.

I presume MPI had problems to buffer all that data.
I guess MPI message size doesn't scale all the way up to very big things,
and where it breaks may be dependent on how much memory you have, the 
system, etc.

Splitting the messages into smaller chunks,
and using a communication loop controlled by the actual size of the data 
being passed
(the larger the total data size, the larger the number of loop iterations,
to keep the message size reasonably small)
solved the problem.
Of course there are more sophisticated ways to implement this
using MPI data types to split the data/messages, etc.

I hope this helps,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Bellanca Gaetano wrote:

Hi,

I'm running a simulation on a Linux Box (ubuntu 8 with kernel 
2.6.24-21-generic).
It is a fortran program compiled with Intel 10.1 and mpich2-1.0.7.
The CPU is an AMD 64 X2 Dual Core 5600+

The simulation works correctly with a small data set, and I can use 1, 
2, 4 ... processes (mpiexec -n #) to emulate a cluster, but when the 
data set is increased, the simulation runs only if I use more than 4 or 
6 processes (the number depends on the dimension of the data set; a 
bigger data set requires an increased number of processes).
 
It stops in a send-receive communication between all the PEs and PE0, 
and stops also if I use mpi_gatherv.
But, very strange, if I stop mpd (which starts at boot time) and restart 
it manually, the simulation works without any error with the same data 
set !
If I run the simulation in a linux cluster, I have the same behavior 
except that, also restarting mpd, the simulation doesn't work if I don't 
use enough processes.

Any idea?

Regards.

gaetano




More information about the mpich-discuss mailing list