general question on speed using quad core Xeons
Satish Balay
petsc-maint at mcs.anl.gov
Mon Apr 21 09:18:47 CDT 2008
On Mon, 21 Apr 2008, amjad ali wrote:
> Hello Petsc team (especially Satish and Barry).
>
> YOU SAID: FOR Better performance
>
> (1) high per-CPU memory performance. Each CPU (core in dual core systems)
> needs to have its own memory bandwith of roughly 2 or more gigabytes.
This 2GB/core number is a rabbit out of the hat. We just put some
reference point out - a few years back for SMP machines [when the age
of multi-core chips hasn't yet begun].
Now Intel has chipsets that can give 25GB/s. They now put 4 cores or 8
cores on this machine. [i.e 6Gb/s for 4core and 3Gb/s for the 8core
machine]
But the trend now is to cram more and more cores - so expect the
number of cores to increase faster than the chipset
memory-bandwidth. [i.e badwidth per core is likely to get smaller and
smaller]
>
> (2) MEMORY BANDWDITH PER CORE, the higher that is the better performance you
> get.
>
> From these points I started to look for RAM Sticks with higher MHz rates
> (and obviously CPUs and motherboards supporting this speed).
>
> But you also reflected to:
>
> http://www.intel.com/performance/server/xeon/hpc_ansys.htm
> http://www.intel.com/performance/server/xeon/hpc_md_nastr.htm
>
> On these pages you pointed out that: systems with CPUs of 20% higher FSB
> speed are performing 20% better. But you see also RAM speed is 20% higher
> for the better performing system (i.e 800MHz vs 667 MHz).
>
> So my question is that which is the actual indicator of "memory
> bandwidth"per core?
> Whether it is
> (1) CPU's FSB speed
> (2) RAM speed
> (3) Motherboard's System Bus Speed.
The answer is a bit complicated here. It depends upon the system
architure.
CPU Chip[s] <-----> chipset <-----> memory [banks]
- Is the bandwidth on the CPU-Chip side is same as on the memory side?
[there are machines where this is different, but most macines use
*synchronous* buses - so that the 'memory chipset' does not have to
do translation/buffering]
For eg - On intel Xeon machine with DDR2-800 - you have [othe memory bus side]:
bandwidth = 2(banks)* 2(ddr)* 8(bytes bus) * 800 MHz/sec * = 25.6GByte/sec
The othe CPU side - its balanced by FSB1600 =>
Bandwidth = 1600MHz * 8(bytes bus)* 2(CPU-chips) = 25.6GByte/se
So generally all the 3 things you've listed has to *match* correctly.
[Some CPUs and chipsets support multiple FSB frequencies - so have to
check what freq is set for the machine you are buying.]
This choice can have *cost* implications.. Is it worth it to spend 20%
more to get 20%more bandwidth? Perhaps yes for sparse-matrix
appliations - but not for others..
> How we could ensure "memory bandwith of roughly 2 or more gigabytes" per CPU
> core? (Higher CPU's FSB speed, or RAM speed or Motherboard's System Bus
> Speed).
As mentioned 2GB/core is a approximate nubmer we thought off a few
years back - when there were no multi-core machine [just SMP
chipsets].
All we can do is eavalue the memorybandwidth number for a given
machine. We can't *ensure* it - as this is a choice made by and other
chip designers.[intel, amd, ibm etc..] The choice for the currently
available products was probably made a few years back.
There is another component to this memory bandwidth debate. Which of
the following do we want?
1. best scalability chip? [when comparing the performance from 1-N cores]
2. overall best performance on 1-core. or N cores [i.e node].
And from the system architecture issues - mentioned above - there are
a couple of other issues that influcene this.
- are the CPU-Chips sharing bandwidth or spliting bandwidth?
- within the CPU-Chip [multi-core] is the memory bus shared or split?
The first one can achieved by the hardware spliting up 1/Nth total
available bandwidth per core. So it shows scalable results. But the
1-core performance can be low.
The second choice could happen by not spliting - but sharing at the
core level. For eg: Intel machines - memory bandwidth is divided at
the CPU-chip level.
For the example case MatMult from ex2 on 8-core intel machine had the
following performance on 1,2,4,8 cores:
397, 632, 724, 749 [MFlop/s]
To me - its not clear which architecture is better. For publishing
scalability results - the above numbers don't look good. [but it could
be the best performance you can squeze out any sequential job - or out
of any 8-core architecture]
Satish
More information about the petsc-users
mailing list