[mpich-discuss] MPICH2-1.0.8 performance issues on Opteron Cluster

Mon Jan 5 13:27:23 CST 2009

It might be a good idea to accompany each mpich2 release with extensive
performance benchmarks on popular mpi-based numerical libraries.
I can think of these:
http://www.netlib.org/scalapack
http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview
http://www.mcs.anl.gov/petsc/petsc-as/

For instance generate a couple of large random matrices and run O(n^3)
scalapack methods (pdgemm, etc.) to demonstrate that the newest mpich2
release is indeed an improvement.

> Do you know if this is the case with your apps?
Always one proc per core, no threads (export OMP_NUM_THREADS=1), no cpu contention.

> Would you be able to try the latest alpha version of 1.1?
would be glad to help when I am back, on vacation until jan-16 :-)

Thank you!

On Mon, 2009-01-05 at 12:15 -0500, Darius Buntinas wrote:
> James, Dmitry,
> 
> Would you be able to try the latest alpha version of 1.1?
> 
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.1a2/src/mpich2-1.1a2.tar.gz
> 
> Nemesis is the default channel in 1.1, so you don't have to specify
> --with-device= when configuring.
> 
> Note that if you have more than one process and/or thread per core,
> nemesis won't perform well.  This is because nemesis does active polling
> (but we expect to have a non-polling option for the final release).  Do
> you know if this is the case with your apps?
> 
> Thanks,
> -d
> 
> On 01/05/2009 09:15 AM, Dmitry V Golovashkin wrote:
> > We have similar experiences with nemesis in a prior mpich2 release.
> > (scalapack-ish applications on multicore linux cluster).
> > The resultant times were remarkably slower. The nemesis channel was an
> > experimental feature back then, I attributed slower performance to a
> > possible misconfiguration.
> > Is it possible to submit a new ticket (for non-ANL folks)?
> > 
> > 
> > 
> > On Mon, 2009-01-05 at 09:00 -0500, James S Perrin wrote:
> >> Hi,
> >> 	I thought I'd just mention that I too have found that our software 
> >> performs poorly with nemesis compared to ssm on our multi-core machines. 
> >> I've tried it on both a 2xDual core AMD x64 and 2xQuad core Xeon x64 
> >> machines. It's roughly 30% slower. I've not been able to do any analysis 
> >> as yet as to where the nemesis version is loosing out?
> >>
> >> 	The software performs mainly point-to-point communication in a master 
> >> and slaves model. As the software is interactive the slaves call 
> >> MPI_Iprobe while waiting for commands. Having compiled against the ssm 
> >> version would have no effect, would it?
> >>
> >> Regards
> >> James
> >>
> >> Sarat Sreepathi wrote:
> >>> Hello,
> >>>
> >>> We got a new 10-node Opteron cluster in our research group. Each node 
> >>> has two quad core Opterons. I installed MPICH2-1.0.8 with Pathscale(3.2) 
> >>> compilers and three device configurations (nemesis,ssm,sock). I built 
> >>> and tested using the Linpack(HPL) benchmark with ACML 4.2 BLAS library 
> >>> for the three different device configurations.
> >>>
> >>> I observed some unexpected results as the 'nemesis' configuration gave 
> >>> the worst performance. For the same problem parameters, the 'sock' 
> >>> version was faster and the 'ssm' version hangs. For further analysis, I 
> >>> obtained screenshots from the Ganglia monitoring tool for the three 
> >>> different runs. As you can see from the attached screenshots, the 
> >>> 'nemesis' version is consuming more 'system cpu' according to Ganglia. 
> >>> The 'ssm' version fares slightly better but it hangs towards the end.
> >>>
> >>> I may be missing something trivial here but can anyone account for this 
> >>> discrepancy? Isn't 'nemesis' device or 'ssm' device recommended for this 
> >>> cluster configuration? Your help is greatly appreciated.
> > 
>