[petsc-users] Very poor speed up performance

Wed Dec 22 11:49:40 CST 2010

On Wed, Dec 22, 2010 at 6:32 PM, Satish Balay <balay at mcs.anl.gov> wrote:

> On Wed, 22 Dec 2010, Yongjun Chen wrote:
>
> > On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <balay at mcs.anl.gov> wrote:
> >
> > > On Wed, 22 Dec 2010, Yongjun Chen wrote:
> > >
> > > > On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> > >
> > > > > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz
> > > > > >
> > > > > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit,
> so
> > > the
> > > > > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s.
> > > > >
> > > > >    Wait a minute. You have 16 cores that share 5.4 GB/s???? This is
> not
> > > > > enough for iterative solvers, in fact this is absolutely terrible
> for
> > > > > iterative solvers. You really want 5.4 GB/s PER core! This machine
> is
> > > > > absolutely inappropriate for iterative solvers. No package can give
> you
> > > good
> > > > > speedups on this machine.
> > > >
> > > > Barry, there are 16 memories, every 2 memories make up one dual
> channel,
> > > > thus in this machine there are 8 dual channel, each dual channel has
> the
> > > > memory bandwidth 5.4GB/s.
> > >
> > > What hardware is this? [processor/chipset?]
> > >
> >
> > By dmidecode, it shows the processor is
> >
> > Handle 0x0010, DMI type 4, 40 bytes
> > Processor Information
> >         Socket Designation: CPU 4
> >         Type: Central Processor
> >         Family: Quad-Core Opteron
> >         Manufacturer: AMD
> >         ID: 06 05 F6 40 74 03 E8 3D
> >         Signature: Family 5, Model 0, Stepping 6
> >         Flags:
> >                 DE (Debugging extension)
> >                 TSC (Time stamp counter)
> >                 MSR (Model specific registers)
> >                 PAE (Physical address extension)
> >                 CX8 (CMPXCHG8 instruction supported)
> >                 APIC (On-chip APIC hardware supported)
> >                 CLFSH (CLFLUSH instruction supported)
> >                 DS (Debug store)
> >                 ACPI (ACPI supported)
> >                 MMX (MMX technology supported)
> >                 FXSR (Fast floating-point save and restore)
> >                 SSE2 (Streaming SIMD extensions 2)
> >                 SS (Self-snoop)
> >                 HTT (Hyper-threading technology)
> >                 TM (Thermal monitor supported)
> >         Version: Quad-Core AMD Opteron(tm) Processor 8360 SE
> >         Voltage: 1.5 V
> >         External Clock: 200 MHz
> >         Max Speed: 4600 MHz
> >         Current Speed: 2500 MHz
> >         Status: Populated, Enabled
> >         Upgrade: Other
> >         L1 Cache Handle: 0x0011
> >         L2 Cache Handle: 0x0012
> >         L3 Cache Handle: 0x0013
> >         Serial Number: N/A
> >         Asset Tag: N/A
> >         Part Number: N/A
> >         Core Count: 4
> >         Core Enabled: 4
> >         Characteristics:
> >                 64-bit capable
>
> ok - your machine has the following schematic.. [from google]
>
> http://www.qdpma.com/SystemArchitecture_files/013_Opteron.png
>
> > > >From what you say - it looks like each chip has 4cores, and 2
> > > dual-channel memory controllers for each of them.
> > >
> > > The question is - does the hardware provide scalable memory-bandwidth
> > > per core?  Most machines don't.
> > >
> >
> > This point is not clear for me right now.
>
> Hm.. the point is: the hardware designer had 2 choices:
>
> - provide a single memory controller per core [so each core gets only
>  2.7gb/s - i.e 4 memory controllers per CPU, and common L2 cache
>  across all cores not possible]
>
> - provide a single memory controller with 2-dual memory channels [i.e
>  10.8GB/s] thats shared by 1-4 cores. With this - there can be a
>  single L2 cache for all 4 cores.
>
> Which of the above 2 is a good design? The first one provides scalable
> performance - but the second one doesn't. Also the first one limits
> the performance of sequential [np=1 applications]. The second one
> provides all bandwidth to even np=1 codes - so they might have better
> sequential performane. And then performance differences due to different
> cache synchronization issues..
>
> Satish
>
> Thanks a lot, Satish. It is much clear now. But for the choice of the two,
the program dmidecode does not show this information. Do you know any way to
get it?

>
>
> >
> >
> >
> > > I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core
> > > run.
> > >
> > > So if the algorithm is able to use 5.4GB/s [or more] for 1 threads,
> > > 10.8 [or more] for 2 threads - you would just see scalable performance
> > > from 1 to 2, and 3, 4 would perhaps be slightly incremental to the
> > > 2-core performance.
> > >
> > > Satish
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20101222/0eb5771b/attachment.htm>