[MPICH] MPICH2 performance tuning and characterising

Thu Mar 15 12:25:06 CDT 2007

On Thu, 15 Mar 2007, stephen mulcahy wrote:

> Hi,
>
> We're currently using MPICH2 (compiled with pgi) on a 20 node cluster of
> dual core opterons connected with a gigabit network to run an
> Oceanographic numerical model (http://www.myroms.org/). I've been
> reading the archives of this list and the all the documentation on the
> MPICH2 website but I haven't come across a lot of concrete performnce
> tuning information. I did come across the MPE/SLOG/Jumpshot
> documentation but I'm still not sure I know enough to use the
> information that provides :)

If you have any question on MPE2/SLOG2/Jumpshot, you can always ask us.

>
> We're currently using ch3:sock but wondering if we should look at
> ch3:ssm, ch3:nemesis or ch3: sctp (the 1.0.5 notes suggest that sctp on
> Linux still has issues). What kind of performance increases should I
> expecting from these - percents or fractions of percents? Also, what
> kind of performance improvement on a cluster should --enable-fast bring?)

Does your MPI job requires thread support ?  If not, you should give
ch3:nemesis a try and compare the result with ch3:sock's.  The simplest
way to time your job is:

date ; mpiexec -np 40 myroms ... ; date

>
> Are there other things I can do to tune the operation of MPICH2 at the
> OS level?

I would guess that you could tune the tcp frame size based on the typical
message size used in your apps.  A google search on the subject should
help.

> I'm not a Fortran programmer and we're using a 3rd party model
> so I'm not sure how much customisation is feasible (I'm assuming the
> developers of the model have done a certain amount of tuning of the
> basic model ... but any suggestions on how to verify this are also
> welcome).

To verify if the code is fairly optimized, you could enable MPE logging
on your app and see if the logfile makes sense to you.  If you are using
mpich2-1.0.5, you can relink your code with "mpif90 -mpe=mpilog", it will
generate a clog2 file which can be "viewed" by jumpshot (which will
convert it to the native format, slog2).  If your slog2 file is small
enough, you could send it to us when you have questons (We always like to
look at logfile, :))

> Is the speed at which MPI operates largely dependent on the
> interconnect latency or are there other factors involved?

Both latency and bandwidth of the network affects MPI performance.
If your MPI job communicates with many/often small messages, latency
will be the determining factor.

> MPI message sizes are estimated to be about 40k and 200k, does the size
> of the MPI message influence behaviour significantly?

Yes.  Given the estimated message size in your app, I would think
bandwidth of the network plays a more crucical role in your
job performanace.

> Do different sized messages have better affinity to different kinds of
> interconnect?

Probably not.

> What have peoples experiences been in moving from Gigabit to Infiniband?
> Should one always realise significant performance benefits or does it
> depend on the load. What kind of MPI loads lend themselves to faster
> interconnects? How can I see if our MPI load has those characteristics?

If your app is well-written MPI code and is tighly coupled parallel job, a
better interconect will significantly improve performance.  I have no
experience on infiniband, so can't comment on it.  For example, an
astrophysical code, FLASH2, performed much better with myrinet than with
ethernet.

A.Chan