[MPICH] What is the most efficient MPICH-2 communication device for a dual-core dual-processor Linux PC?

Fri Sep 28 17:59:28 CDT 2007

I would suggest you using ch3:nemesis since it does use shared memory
and it is actively being worked on.

A.Chan

On Fri, 28 Sep 2007, Gus Correa wrote:

> Dear MPICH experts
>
> What is the most efficient MPICH-2 communication device for a
> dual-core dual-processor Linux PC (i.e. not a cluster, not distributed
> memory)?
>
> Among the four alternatives ch3:sock, ch3:ssm, ch3:shm, and ch3:nemesis,
> I presume ssm and shm work the same way in a SMP machine like this,
> but beyond this I don't know which of the four devices would be the best.
>
> I don't need threading or any MPI-2 specific features such as dynamic
> process creation,
> hence I guess I can choose any one of the four devices.
> I am just concerned about performance,
> and to take the best possible advantage of the multicore /
> multiprocessor feature.
>
> This question was probably asked before,
> hence I would be happy if you just give me a pointer
> to the right thread on the mpich-discuss list archive that may contain
> the answer or a simple benchmark.
>
> There is more information about my problem below, if you need.
>
> Many thanks,
> Gus Correa
>
> *************
>
> More info:
>
> 0) I am using ch3:sock, just because it is the default!
> 1) The computer is a 64-bit dual-core dual-processor  3GHz Intel Xeon
> standalone PC with 4GB of memory.
> 2) The OS is Linux 2.6.22.5-76.fc7 (Fedora Core 7).
> 3) The compilers are gcc and Intel Fortran 10.0.023.
> 4) The version of MPICH-2 is 1.0.5p4, but I can upgrade to the latest
> greatest, if this is required.
> 5) Compilation was done in 64-bit mode.
> 6) The program I am running is AM2.1 from GFDL (see
> http://www.gfdl.noaa.gov/fms/),
> written in Fortran 90/95.
> It is a typical grid and domain decomposition problem, although a quite
> complex code.
> I don't know if the program scales well with the number of processors, BTW,
> but it is not dominated by I/O, and the computational effort
> is in number crunching and communication.
> 7) The execution times (wall clock) I get for one model-month simulation
> are:
>
>    - 1 process :     224 minutes
>    - 2 processes : 122 minutes
>    - 3 processes:  108 minutes
>
> (FYI, due to restrictions in the program itself, I cannot use 4 processes.
> The program requires that the number of processes divides 90,
> which is the number of points in one of the grid directions.)
>
> Note that the execution time ratio between 2 and 1 process is close to
> 1/2, which made me happy.  :)
> However, with 3 processes the run takes almost the same time as with 2,
> so things become disappointingly slow.  :(
>
> What is the reason for this slow down?
> Should I use another device, instead of ch3:sock?  Which comm device is
> the best?
> Or does it slow down because the multicore feature kicks in when I move
> from 2 to 3 processes,
> and perhaps multicore is not really so fast?  (I which case changing
> comm device would not speed up.)
>
> Thank you,
> Gus Correa
>
> --
> ---------------------------------------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
> ---------------------------------------------------------------------
>
>