<br><br><div class="gmail_quote">On Wed, Dec 22, 2010 at 5:54 PM, Satish Balay <span dir="ltr"><<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">On Wed, 22 Dec 2010, Yongjun Chen wrote:<br>
<br>
> On Wed, Dec 22, 2010 at 5:40 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
<br>
</div><div class="im">> > > Processors: 4 CPUS * 4Cores/CPU, with each core 2500MHz<br>
> > ><br>
> > > Memories: 16 *2 GB DDR2 333 MHz, dual channel, data width 64 bit, so the<br>
> > memory Bandwidth for 2 memories is 64/8*166*2*2=5.4GB/s.<br>
> ><br>
> > � �Wait a minute. You have 16 cores that share 5.4 GB/s???? This is not<br>
> > enough for iterative solvers, in fact this is absolutely terrible for<br>
> > iterative solvers. You really want 5.4 GB/s PER core! This machine is<br>
> > absolutely inappropriate for iterative solvers. No package can give you good<br>
> > speedups on this machine.<br>
><br>
> Barry, there are 16 memories, every 2 memories make up one dual channel,<br>
> thus in this machine there are 8 dual channel, each dual channel has the<br>
> memory bandwidth 5.4GB/s.<br>
<br>
</div>What hardware is this? [processor/chipset?]<br></blockquote><div><br>By dmidecode, it shows the processor is<br><br>Handle 0x0010, DMI type 4, 40 bytes<br>Processor Information<br>������� Socket Designation: CPU 4<br>
������� Type: Central Processor<br>������� Family: Quad-Core Opteron<br>������� Manufacturer: AMD������������� <br>������� ID: 06 05 F6 40 74 03 E8 3D<br>������� Signature: Family 5, Model 0, Stepping 6<br>������� Flags:<br>
��������������� DE (Debugging extension)<br>��������������� TSC (Time stamp counter)<br>��������������� MSR (Model specific registers)<br>��������������� PAE (Physical address extension)<br>��������������� CX8 (CMPXCHG8 instruction supported)<br>
��������������� APIC (On-chip APIC hardware supported)<br>��������������� CLFSH (CLFLUSH instruction supported)<br>��������������� DS (Debug store)<br>��������������� ACPI (ACPI supported)<br>��������������� MMX (MMX technology supported)<br>
��������������� FXSR (Fast floating-point save and restore)<br>��������������� SSE2 (Streaming SIMD extensions 2)<br>��������������� SS (Self-snoop)<br>��������������� HTT (Hyper-threading technology)<br>��������������� TM (Thermal monitor supported)<br>
������� Version: Quad-Core AMD Opteron(tm) Processor 8360 SE�������� <br>������� Voltage: 1.5 V<br>������� External Clock: 200 MHz<br>������� Max Speed: 4600 MHz<br>������� Current Speed: 2500 MHz<br>������� Status: Populated, Enabled<br>
������� Upgrade: Other<br>������� L1 Cache Handle: 0x0011<br>������� L2 Cache Handle: 0x0012<br>������� L3 Cache Handle: 0x0013<br>������� Serial Number: N/A<br>������� Asset Tag: N/A<br>������� Part Number: N/A<br>������� Core Count: 4<br>
������� Core Enabled: 4<br>������� Characteristics:<br>��������������� 64-bit capable<br><br clear="all"><br><br>�</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
>From what you say - it looks like each chip has 4cores, and 2<br>
dual-channel memory controllers for each of them.<br>
<br>
The question is - does the hardware provide scalable memory-bandwidth<br>
per core? �Most machines don't.<br></blockquote><div><br>This point is not clear for me right now.<br><br>�</div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
I.e the same 5.4*2GB/s is avilable for 1 core run as well as the 4 core run.<br>
<br>
So if the algorithm is able to use 5.4GB/s [or more] for 1 threads,<br>
10.8 [or more] for 2 threads - you would just see scalable performance<br>
from 1 to 2, and 3, 4 would perhaps be slightly incremental to the<br>
2-core performance.<br>
<font color="#888888"><br>
Satish<br>
</font></blockquote></div><br><br>