general question on speed using quad core Xeons

Mon Apr 21 09:18:47 CDT 2008

On Mon, 21 Apr 2008, amjad ali wrote:

> Hello Petsc team (especially Satish and Barry).
> 
> YOU SAID: FOR Better performance
> 
> (1) high per-CPU memory performance. Each CPU (core in dual core systems)
> needs to have its own memory bandwith of roughly 2 or more gigabytes.

This 2GB/core number is a rabbit out of the hat. We just put some
reference point out - a few years back for SMP machines [when the age
of multi-core chips hasn't yet begun].

Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8
cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core
machine]

But the trend now is to cram more and more cores - so expect the
number of cores to increase faster than the chipset
memory-bandwidth. [i.e badwidth per core is likely to get smaller and
smaller]

> 
> (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you
> get.
> 
> From these points I started to look for RAM Sticks with higher MHz rates
> (and obviously CPUs and motherboards supporting this speed).
> 
> But you also reflected to:
> 
> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
> 
> On these pages you pointed out that: systems with CPUs of 20% higher FSB
> speed are performing 20% better. But you see also RAM speed is 20% higher
> for the better performing system (i.e 800MHz vs 667 MHz).
> 
> So my question is that which is the actual indicator of "memory
> bandwidth"per core?
> Whether it is
> (1) CPU's FSB speed
> (2) RAM speed
> (3) Motherboard's System Bus Speed.

The answer is a bit complicated here. It depends upon the system
architure.

CPU Chip[s]  <-----> chipset  <-----> memory [banks]

- Is the bandwidth on the CPU-Chip side is same as on the memory side?
 [there are machines where this is different, but most macines use
 *synchronous* buses - so that the 'memory chipset' does not have to
 do translation/buffering]

For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]:
bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec *  = 25.6GByte/sec

The othe CPU side - its balanced by FSB1600 =>
Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se

So generally all the 3 things you've listed has to *match* correctly.
[Some CPUs and chipsets support multiple FSB frequencies - so have to
check what freq is set for the machine you are buying.]

This choice can have *cost* implications.. Is it worth it to spend 20%
more to get 20%more bandwidth? Perhaps yes for sparse-matrix
appliations - but not for others..

> How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU
> core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus
> Speed).

As mentioned 2GB/core is a approximate nubmer we thought off a few
years back - when there were no multi-core machine [just SMP
chipsets].

All we can do is eavalue the memorybandwidth number for a given
machine. We can't *ensure* it - as this is a choice made by and other
chip designers.[intel, amd, ibm etc..] The choice for the currently
available products was probably made a few years back.

There is another component to this memory bandwidth debate. Which of
the following do we want?

1. best scalability chip? [when comparing the performance from 1-N cores]
2. overall best performance on 1-core. or N cores [i.e node].

And from the system architecture issues - mentioned above - there are
a couple of other issues that influcene this.

- are the CPU-Chips sharing bandwidth or spliting bandwidth?
- within the CPU-Chip [multi-core] is the memory bus shared or split?

The first one can achieved by the hardware spliting up 1/Nth total
available bandwidth per core. So it shows scalable results. But the
1-core performance can be low.

The second choice could happen by not spliting - but sharing at the
core level.  For eg: Intel machines - memory bandwidth is divided at
the CPU-chip level.

For the example case MatMult from ex2 on 8-core intel machine had the
following performance on 1,2,4,8 cores:
397, 632, 724, 749 [MFlop/s]

To me - its not clear which architecture is better. For publishing
scalability results - the above numbers don't look good. [but it could
be the best performance you can squeze out any sequential job - or out
of any 8-core architecture]

Satish