[mpich-discuss] Scalability of 'Intel Core 2 duo' cluster

Fri Mar 28 11:17:28 CDT 2008

Some general thoughts on scaling:

1) Application Dependent
	a) every application has an inherent limit to its scalability.
Rarely is that limit infinite.  If my application doesn't scale well
past 64 CPUs, then at 128 CPUs it will actually slow down.  Said another
way, speed is a function that is concave down, and has a maximum at a
certain number of CPUs.
	b)Some applications are written specifically for a given
architecture, and then ported around.  They may not have been optimized
for the other architectures at all.
	c) Some apps use disk space too often to scale very well.  This
may also be configurable.  If you have it configured wrong, it won't
scale. 
2) Infrastructure Dependent
	a) Gigabit Ethernet on copper does not scale well...yes, I know,
if I do this and tweak that and the stars are aligned, and I sacrifice a
chicken at exactly 3:45 EST, then it is great.  But really, if you have
the funds, you are better off with something like myrinet or Infiniband.
I know that many people will say I'm wrong.  It's just my humble
opinion. P.S. - I have no experience with GigE over fiber.  
	b) How you implement your network can have a big effect. I won't
even get into the many possibilities here. 
	c) The implementation of MPI you are using can have around a 30%
performance implication, sometimes even up to 50%. MPICH is the
standard, but HP MPI or SCALI or something like that will have HUGE
performance implications. They might have cost implications too....
	d) Heat - if your machines are running hot, they may actually
throttle themselves down.  Do the Core2 and Core4's do this? I don't
actually know if they do, so you'll have to check them.
3) Memory
	a) Make sure you are not paging - AT ALL!!! If anything involved
in running your program is sent to virtual memory, kiss your performance
and scalability goodbye.  You don't want to be running anything you
don't need on your nodes.
	b) Are you sharing this cluster? Are other programs running on
your nodes at the same time? Not strictly a memory issue, but make sure
you run your benchmarks without OPP (Other Peoples Programs) in your
way.
	c) Cache.  Generally, you need some kind of profiler to help
determine if you are thrashing your cache.  In the dual / quad core
world it can be simpler, though.  Try running only one process per CPU
socket and see what happens.  Sometimes you can use up the cache in
multicore CPUs even without thrashing.  See note 1b above as well. A
program that sings on an Altix can stink on a cluster if it is not tuned
properly. Profilers like the ones others have mentioned in this thread
can help. 
4) Problem
	a) I can run the exact same code with a different problem and it
will scale differently depending on the physics involved, the size of
the problem, and a host of other things.  Understand if it is actually
the problem itself that doesn't scale.  Usually there are some
benchmarks available that you can try.
5) Wildcards
	a) I have no idea.  That's why they are wildcards, because every
system is a bit different and could have problems. 

Ideas:

 - Run benchmarks when the system is quiet.  Try variations of how many
cores per node you are using. Look carefully at memory, network, disk,
and any other performance measures you can get your hands on.
 - Identify where the problem is.  If the benchmarks run the same on
your system, it may be memory or problem dependent.  If the benchmarks
don't scale properly, there is something in the system.
 - Get performance tools.  There are many tools to help you identify
where the bottleneck is. 
 - Change compilers. Could help. 
 - Check system settings. I've seen large clusters where one node
decides for some reason to run at 10Mbps instead of Gigabit.
 - Recompile MPI? Change to a different MPI?
 - Change your network topology? Faster disk? More memory?

Hope this helps. 

-----Original Message-----
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Ladd
Sent: Friday, March 28, 2008 10:21 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Scalability of 'Intel Core 2 duo' cluster

There are many possible reasons for poor scaling with gigabit ethernet. 
The two most significant issues that I have found are
1) Poorly performing switches
2) Inadequate algorithms for collectives
These issues are discussed in reference to our Beowulf cluster at 
http://ladd.che.ufl.edu/research/beoclus/beoclus.htm. Bottom line is we 
have applications such as VASP and GROMACS that scale quite comparably 
on our GigE cluster to an Infiniband HPC system up to about 100 
proccessors. With TCP the Infiniband wins out but with GAMMA the GigE 
cluster can outperform the HPC system.

Typical Edge switches (even high end ones costing ~$5K+) are 
oversubscribed on the backplane. This can lead to packet loss with a 
very large drop in performance. I found factors of 100 difference in 
throughput depending on the layout of the nodes on the switch. Details 
are on on our website-its a lot of stuff, but its not a simple story.

The second big issue is collective performance. Easy to check by trying 
applications with only point to point messages. The best collectives are

in MPICH in general-it has the most advanced algorithms. But the 
Alltoall and similar routines suck. I have mentioned this to Rajeev and 
it is apparently being looked into. The problem is MPICH posts all the 
receives at once and then sends messages essentially randomly. This 
leads to oversubscription of the NICS and packet loss. For reasons I 
dont understand it is much more problematic on multicore nodes than 
single core. I am pretty sure a properly scheduled alltoall would solve 
this problem. I tested a naive version under MPICH1 (same algorithm) and

it performed much better. I can see that there is a yet better algorithm

based on tournament scheduling, but I have not had time to code a demo 
yet. Maybe over the summer.

Tony

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514