[petsc-users] Very poor speed up performance

Wed Dec 22 11:32:10 CST 2010

On Wed, 22 Dec 2010, Yongjun Chen wrote:

> On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <balay at mcs.anl.gov> wrote:
> 
> > On Wed, 22 Dec 2010, Yongjun Chen wrote:
> >
> > > On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > > > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz
> > > > >
> > > > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit, so
> > the
> > > > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s.
> > > >
> > > >    Wait a minute. You have 16 cores that share 5.4 GB/s???? This is not
> > > > enough for iterative solvers, in fact this is absolutely terrible for
> > > > iterative solvers. You really want 5.4 GB/s PER core! This machine is
> > > > absolutely inappropriate for iterative solvers. No package can give you
> > good
> > > > speedups on this machine.
> > >
> > > Barry, there are 16 memories, every 2 memories make up one dual channel,
> > > thus in this machine there are 8 dual channel, each dual channel has the
> > > memory bandwidth 5.4GB/s.
> >
> > What hardware is this? [processor/chipset?]
> >
> 
> By dmidecode, it shows the processor is
> 
> Handle 0x0010, DMI type 4, 40 bytes
> Processor Information
>         Socket Designation: CPU 4
>         Type: Central Processor
>         Family: Quad-Core Opteron
>         Manufacturer: AMD
>         ID: 06 05 F6 40 74 03 E8 3D
>         Signature: Family 5, Model 0, Stepping 6
>         Flags:
>                 DE (Debugging extension)
>                 TSC (Time stamp counter)
>                 MSR (Model specific registers)
>                 PAE (Physical address extension)
>                 CX8 (CMPXCHG8 instruction supported)
>                 APIC (On-chip APIC hardware supported)
>                 CLFSH (CLFLUSH instruction supported)
>                 DS (Debug store)
>                 ACPI (ACPI supported)
>                 MMX (MMX technology supported)
>                 FXSR (Fast floating-point save and restore)
>                 SSE2 (Streaming SIMD extensions 2)
>                 SS (Self-snoop)
>                 HTT (Hyper-threading technology)
>                 TM (Thermal monitor supported)
>         Version: Quad-Core AMD Opteron(tm) Processor 8360 SE
>         Voltage: 1.5 V
>         External Clock: 200 MHz
>         Max Speed: 4600 MHz
>         Current Speed: 2500 MHz
>         Status: Populated, Enabled
>         Upgrade: Other
>         L1 Cache Handle: 0x0011
>         L2 Cache Handle: 0x0012
>         L3 Cache Handle: 0x0013
>         Serial Number: N/A
>         Asset Tag: N/A
>         Part Number: N/A
>         Core Count: 4
>         Core Enabled: 4
>         Characteristics:
>                 64-bit capable

ok - your machine has the following schematic.. [from google]

http://www.qdpma.com/SystemArchitecture_files/013_Opteron.png

> > >From what you say - it looks like each chip has 4cores, and 2
> > dual-channel memory controllers for each of them.
> >
> > The question is - does the hardware provide scalable memory-bandwidth
> > per core?  Most machines don't.
> >
> 
> This point is not clear for me right now.

Hm.. the point is: the hardware designer had 2 choices:

- provide a single memory controller per core [so each core gets only
  2.7gb/s - i.e 4 memory controllers per CPU, and common L2 cache
  across all cores not possible]

- provide a single memory controller with 2-dual memory channels [i.e
  10.8GB/s] thats shared by 1-4 cores. With this - there can be a
  single L2 cache for all 4 cores.

Which of the above 2 is a good design? The first one provides scalable
performance - but the second one doesn't. Also the first one limits
the performance of sequential [np=1 applications]. The second one
provides all bandwidth to even np=1 codes - so they might have better
sequential performane. And then performance differences due to different
cache synchronization issues..

Satish

> 
> 
> 
> > I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core
> > run.
> >
> > So if the algorithm is able to use 5.4GB/s [or more] for 1 threads,
> > 10.8 [or more] for 2 threads - you would just see scalable performance
> > from 1 to 2, and 3, 4 would perhaps be slightly incremental to the
> > 2-core performance.
> >
> > Satish
> >
>