[petsc-users] Very poor speed up performance

Yongjun Chen yjxd.chen at gmail.com
Wed Dec 22 11:12:43 CST 2010


On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <balay at mcs.anl.gov> wrote:

> On Wed, 22 Dec 2010, Yongjun Chen wrote:
>
> > On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> > > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz
> > > >
> > > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit, so
> the
> > > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s.
> > >
> > >    Wait a minute. You have 16 cores that share 5.4 GB/s???? This is not
> > > enough for iterative solvers, in fact this is absolutely terrible for
> > > iterative solvers. You really want 5.4 GB/s PER core! This machine is
> > > absolutely inappropriate for iterative solvers. No package can give you
> good
> > > speedups on this machine.
> >
> > Barry, there are 16 memories, every 2 memories make up one dual channel,
> > thus in this machine there are 8 dual channel, each dual channel has the
> > memory bandwidth 5.4GB/s.
>
> What hardware is this? [processor/chipset?]
>

By dmidecode, it shows the processor is

Handle 0x0010, DMI type 4, 40 bytes
Processor Information
        Socket Designation: CPU 4
        Type: Central Processor
        Family: Quad-Core Opteron
        Manufacturer: AMD
        ID: 06 05 F6 40 74 03 E8 3D
        Signature: Family 5, Model 0, Stepping 6
        Flags:
                DE (Debugging extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (Fast floating-point save and restore)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Hyper-threading technology)
                TM (Thermal monitor supported)
        Version: Quad-Core AMD Opteron(tm) Processor 8360 SE
        Voltage: 1.5 V
        External Clock: 200 MHz
        Max Speed: 4600 MHz
        Current Speed: 2500 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x0011
        L2 Cache Handle: 0x0012
        L3 Cache Handle: 0x0013
        Serial Number: N/A
        Asset Tag: N/A
        Part Number: N/A
        Core Count: 4
        Core Enabled: 4
        Characteristics:
                64-bit capable





> >From what you say - it looks like each chip has 4cores, and 2
> dual-channel memory controllers for each of them.
>
> The question is - does the hardware provide scalable memory-bandwidth
> per core?  Most machines don't.
>

This point is not clear for me right now.



> I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core
> run.
>
> So if the algorithm is able to use 5.4GB/s [or more] for 1 threads,
> 10.8 [or more] for 2 threads - you would just see scalable performance
> from 1 to 2, and 3, 4 would perhaps be slightly incremental to the
> 2-core performance.
>
> Satish
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20101222/bdd4c8cb/attachment.htm>


More information about the petsc-users mailing list