[petsc-users] Very poor speed up performance
Satish Balay
balay at mcs.anl.gov
Wed Dec 22 11:32:10 CST 2010
On Wed, 22 Dec 2010, Yongjun Chen wrote:
> On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>
> > On Wed, 22 Dec 2010, Yongjun Chen wrote:
> >
> > > On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > > > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz
> > > > >
> > > > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit, so
> > the
> > > > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s.
> > > >
> > > > Wait a minute. You have 16 cores that share 5.4 GB/s???? This is not
> > > > enough for iterative solvers, in fact this is absolutely terrible for
> > > > iterative solvers. You really want 5.4 GB/s PER core! This machine is
> > > > absolutely inappropriate for iterative solvers. No package can give you
> > good
> > > > speedups on this machine.
> > >
> > > Barry, there are 16 memories, every 2 memories make up one dual channel,
> > > thus in this machine there are 8 dual channel, each dual channel has the
> > > memory bandwidth 5.4GB/s.
> >
> > What hardware is this? [processor/chipset?]
> >
>
> By dmidecode, it shows the processor is
>
> Handle 0x0010, DMI type 4, 40 bytes
> Processor Information
> Socket Designation: CPU 4
> Type: Central Processor
> Family: Quad-Core Opteron
> Manufacturer: AMD
> ID: 06 05 F6 40 74 03 E8 3D
> Signature: Family 5, Model 0, Stepping 6
> Flags:
> DE (Debugging extension)
> TSC (Time stamp counter)
> MSR (Model specific registers)
> PAE (Physical address extension)
> CX8 (CMPXCHG8 instruction supported)
> APIC (On-chip APIC hardware supported)
> CLFSH (CLFLUSH instruction supported)
> DS (Debug store)
> ACPI (ACPI supported)
> MMX (MMX technology supported)
> FXSR (Fast floating-point save and restore)
> SSE2 (Streaming SIMD extensions 2)
> SS (Self-snoop)
> HTT (Hyper-threading technology)
> TM (Thermal monitor supported)
> Version: Quad-Core AMD Opteron(tm) Processor 8360 SE
> Voltage: 1.5 V
> External Clock: 200 MHz
> Max Speed: 4600 MHz
> Current Speed: 2500 MHz
> Status: Populated, Enabled
> Upgrade: Other
> L1 Cache Handle: 0x0011
> L2 Cache Handle: 0x0012
> L3 Cache Handle: 0x0013
> Serial Number: N/A
> Asset Tag: N/A
> Part Number: N/A
> Core Count: 4
> Core Enabled: 4
> Characteristics:
> 64-bit capable
ok - your machine has the following schematic.. [from google]
http://www.qdpma.com/SystemArchitecture_files/013_Opteron.png
> > >From what you say - it looks like each chip has 4cores, and 2
> > dual-channel memory controllers for each of them.
> >
> > The question is - does the hardware provide scalable memory-bandwidth
> > per core? Most machines don't.
> >
>
> This point is not clear for me right now.
Hm.. the point is: the hardware designer had 2 choices:
- provide a single memory controller per core [so each core gets only
2.7gb/s - i.e 4 memory controllers per CPU, and common L2 cache
across all cores not possible]
- provide a single memory controller with 2-dual memory channels [i.e
10.8GB/s] thats shared by 1-4 cores. With this - there can be a
single L2 cache for all 4 cores.
Which of the above 2 is a good design? The first one provides scalable
performance - but the second one doesn't. Also the first one limits
the performance of sequential [np=1 applications]. The second one
provides all bandwidth to even np=1 codes - so they might have better
sequential performane. And then performance differences due to different
cache synchronization issues..
Satish
>
>
>
> > I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core
> > run.
> >
> > So if the algorithm is able to use 5.4GB/s [or more] for 1 threads,
> > 10.8 [or more] for 2 threads - you would just see scalable performance
> > from 1 to 2, and 3, 4 would perhaps be slightly incremental to the
> > 2-core performance.
> >
> > Satish
> >
>
More information about the petsc-users
mailing list