[mpich-discuss] Why is my quad core slower than cluster

Gus Correa gus at ldeo.columbia.edu
Tue Jul 15 12:34:18 CDT 2008


Hello Eric and list

Thank you Eric Borisch for the very informative and clear response,
on the best spirit of this list!
Thank you Brian Dobbins also, for an equally rich and generous answer.

Even with the "it depends" caveat that Eric wrote,
it is still a relief to know that some applications perform
decently on multi-core, if not strictly memory-bound.

Just a (bit pessimistic/realistic) note on Eric's suggestion to "invest 
on a cluster".
Brian is already running his tests on a quad-core dual-processor cluster.
That is exactly what we also plan to do, to replace our old 
dual-processor single-core one,
where most things speed up near-linearly!
However, clusters these days sell mostly with quad-core dual-processor 
nodes.
It seems to be a marketing (or industrial?) policy that starts with 
processor manufacturers.
Dual-cores may be extinct soon,  and price-wise they don't "scale" as 
well as quad-cores anyway.
So, the same issues that haunt multicore standalone machines also haunt 
quad-core cluster nodes.

Well, not as much as they would before reading your nice messages ...

Thank you,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Eric A. Borisch wrote:

> Gus,
>
> Information sharing is truly the point of the mailing list. Useful 
> messages should ask questions or provide answers! :)
>
> Someone mentioned STREAM benchmarks (memory BW benchmarks) a little 
> while back. I did these when our new system came in a while ago, so I 
> dug them back out.
>
> This (STREAM) can be compiled to use MPI, but it is only a 
> synchronization tool, the benchmark is still a memory bus test (each 
> task is trying to run through memory, but this is not an MPI 
> communication test.)
>
> My results on a dual E5472 machine (Two Quad-core 3Ghz packages; 
> 1600MHz bus; 8 total cores)
>
> Results (each set are [1..8] processes in order), double-precision 
> array size = 20,000,000, run through 10 times.
>
> Function     Rate (MB/s)  Avg time   Min time  Max time
> Copy:       2962.6937      0.1081      0.1080      0.1081
> Copy:       5685.3008      0.1126      0.1126      0.1128
> Copy:       5484.6846      0.1751      0.1750      0.1751
> Copy:       7085.7959      0.1809      0.1806      0.1817
> Copy:       5981.6033      0.2676      0.2675      0.2676
> Copy:       7071.2490      0.2718      0.2715      0.2722
> Copy:       6537.4934      0.3427      0.3426      0.3428
> Copy:       7423.4545      0.3451      0.3449      0.3455
>
> Scale:      3011.8445      0.1063      0.1062      0.1063
> Scale:      5675.8162      0.1128      0.1128      0.1129
> Scale:      5474.8854      0.1754      0.1753      0.1754
> Scale:      7068.6204      0.1814      0.1811      0.1819
> Scale:      5974.6112      0.2679      0.2678      0.2680
> Scale:      7063.8307      0.2721      0.2718      0.2725
> Scale:      6533.4473      0.3430      0.3429      0.3431
> Scale:      7418.6128      0.3453      0.3451      0.3456
>
> Add:        3184.3129      0.1508      0.1507      0.1508
> Add:        5892.1781      0.1631      0.1629      0.1633
> Add:        5588.0229      0.2577      0.2577      0.2578
> Add:        7275.0745      0.2642      0.2639      0.2646
> Add:        6175.7646      0.3887      0.3886      0.3889
> Add:        7262.7112      0.3970      0.3965      0.3976
> Add:        6687.7658      0.5025      0.5024      0.5026
> Add:        7599.2516      0.5057      0.5053      0.5062
>
> Triad:      3224.7856      0.1489      0.1488      0.1489
> Triad:      6021.2613      0.1596      0.1594      0.1598
> Triad:      5609.9260      0.2567      0.2567      0.2568
> Triad:      7293.2790      0.2637      0.2633      0.2641
> Triad:      6185.4376      0.3881      0.3880      0.3881
> Triad:      7279.1231      0.3958      0.3957      0.3961
> Triad:      6691.8560      0.5022      0.5021      0.5022
> Triad:      7604.1238      0.5052      0.5050      0.5057
>
> These work out to (~):
> 1x
> 1.9x
> 1.8x
> 2.3x
> 1.9x
> 2.2x
> 2.1x
> 2.4x
>  
> for [1..8] cores.
>
> As you can see, it doesn't take eight cores to saturate the bus, even 
> with a 1600MHz bus. Four of the eight cores running does this trick.
>
> With all that said, there are still advantages to be had with the 
> multicore chipsets, but only if you're not blowing full tilt through 
> memory. If it can fit the problem, do more inside a loop rather than 
> running multiple loops over the same memory. 
>
> For reference, here's what using the osu_mbw_mr test (from MVAPICH2 
> 1.0.2; I also have a cluster running nearby :) compiled on MPICH2 
> (1.0.7rc1 with nemesis provides this performance from one/two/four 
> pairs (2/4/8 processes) of producer/consumers:
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 1 ] [ window size: 64 ]
>
> #  Size    MB/sec    Messages/sec
>       1      1.08   1076540.83
>       2      2.14   1068102.24
>       4      3.99    997382.24
>       8      7.97    996419.66
>      16     15.95    996567.63
>      32     31.67    989660.29
>      64     62.73    980084.91
>     128    124.12    969676.18
>     256    243.59    951527.62
>     512    445.52    870159.34
>    1024    810.28    791284.80
>    2048   1357.25    662721.78
>    4096   1935.08    472431.28
>    8192   2454.29    299596.49
>   16384   2717.61    165869.84
>   32768   2900.23     88507.85
>   65536   2279.71     34785.63
>  131072   2540.51     19382.53
>  262144   1335.16      5093.21
>  524288   1364.05      2601.72
> 1048576   1378.39      1314.53
> 2097152   1380.78       658.41
> 4194304   1343.48       320.31
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 2 ] [ window size: 64 ]
>
> #  Size    MB/sec    Messages/sec
>       1      2.15   2150580.48
>       2      4.22   2109761.12
>       4      7.84   1960742.53
>       8     15.80   1974733.92
>      16     31.38   1961100.64
>      32     62.32   1947654.32
>      64    123.39   1928000.11
>     128    243.19   1899957.22
>     256    475.32   1856721.12
>     512    856.90   1673642.10
>    1024   1513.19   1477721.26
>    2048   2312.91   1129351.07
>    4096   2891.21    705861.12
>    8192   3267.49    398863.98
>   16384   3400.64    207558.54
>   32768   3519.74    107413.93
>   65536   3141.80     47940.04
>  131072   3368.65     25700.76
>  262144   2211.53      8436.31
>  524288   2264.90      4319.95
> 1048576   2282.69      2176.94
> 2097152   2250.72      1073.23
> 4194304   2087.00       497.58
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 4 ] [ window size: 64 ]
>
> #  Size    MB/sec    Messages/sec
>       1      3.65   3651934.64
>       2      8.16   4080341.34
>       4     15.66   3914908.02
>       8     31.32   3915621.85
>      16     62.67   3916764.51
>      32    124.37   3886426.18
>      64    246.38   3849640.84
>     128    486.39   3799914.44
>     256    942.40   3681232.25
>     512   1664.21   3250414.19
>    1024   2756.50   2691891.86
>    2048   3829.45   1869848.54
>    4096   4465.25   1090148.56
>    8192   4777.45    583184.51
>   16384   4822.75    294357.30
>   32768   4829.77    147392.80
>   65536   4556.93     69533.18
>  131072   4789.32     36539.60
>  262144   3631.68     13853.75
>  524288   3679.31      7017.72
> 1048576   3553.61      3388.99
> 2097152   3113.12      1484.45
> 4194304   2452.69       584.77
>
> So from a messaging standpoint, you can see that you squeeze more data 
> through with more processes; I'd guess that this is because there's 
> processing to be done within MPI to move the data, and a lot of the 
> bookkeeping steps probably cache well (updating the same status 
> structure on a communication multiple times; perhaps reusing the 
> structure for subsequent transfers and finding it still in cache) so 
> the performance scaling is not completely FSB bound.
>
> I'm sure there's plenty of additional things that could be done here 
> to test different CPU to process layouts, etc, but in testing my own 
> real-world code, I've found that, unfortunately, "it depends." I have 
> some code that nearly scales linearly (multiple computationally 
> expensive operations inside the innermost loop) and some that scales 
> like the STREAM results above ("add one to the next 20 million 
> points") ...
>
> As always, your mileage may vary. If your speedup looks like the 
> STREAM numbers above, you're likely memory bound. Try to reformulate 
> your problem to go through memory slower but with more done each pass, 
> or invest in a cluster. At some point -- for some problems -- you 
> can't beat more memory busses!
>
> Cheers,
>  Eric Borisch
>
> --
>  borisch.eric at mayo.edu <mailto:borisch.eric at mayo.edu>
>  MRI Research
>  Mayo Clinic
>
> On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Hello Sami and list
>
>     Oh, well, as you see, an expert who claims to know the answers to
>     these problems
>     seems not to be willing to share these answers with less
>     knowledgeable MPI users like us.
>     So, maybe we can find the answers ourselves, not by individual
>     "homework" brainstorming,
>     but through community collaboration and generous information sharing,
>     which is the hallmark of this mailing list.
>
>     I Googled around today to find out how to assign MPI processes to
>     specific processors,
>     and I found some interesting information on how to do it.
>
>     Below is a link to a posting from the computational fluid dynamics
>     (CFD) community that may be of interest.
>     Not surprisingly, they are struggling with the same type of
>     problems all of us have,
>     including how to tie MPI processes to specific processors:
>
>     http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006
>     <http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006>
>
>     I would summarize these problems as related to three types of
>     bottleneck:
>
>     1) Multicore processor bottlenecks (standalone machines and clusters)
>     2) Network fabric bottlenecks (clusters)
>     3) File system bottlenecks (clusters)
>
>     All three types of problems are due to contention for some type of
>     system resource
>     by the MPI processes that take part in a computation/program.
>
>     Our focus on this thread, started by Zach, has been on problem 1),
>     although most of us may need to look into problems 2) and 3)
>     sooner or later.
>     (I have all the three of them already!)
>
>     The CFD folks use MPI as we do.
>     They seem to use another MPI flavor, but the same problems are there.
>     The problems are not caused by MPI itself, but they become
>     apparent when you run MPI programs.
>     That has been my experience too.
>
>     As for how to map the MPI processes to specific processors (or cores),
>     the key command seems to be "taskset", as my googling afternoon
>     showed.
>     Try "man taskset" for more info.
>
>     For a standalone machine like yours, something like the command
>     line below should work to
>     force execution on "processors" 0 and 2 (which in my case are two
>     different physical CPUs):
>
>     mpiexec -n 2 taskset -c 0,2  my_mpi_program
>
>     You need to check on your computer ("more /proc/cpuinfo")
>     what are the exact "processor" numbers that correspond to separate
>     physical CPUs. Most likely they are the even numbered processors
>     only, or the odd numbered only,
>     since you have dual-core CPUs (integers module 2), with
>     "processors" 0,1 being the four
>     cores of the first physical CPU, "processors" 2,3 the cores of the
>     second physical CPU, and so on.
>     At least, this is what I see on my dual-core dual-processor machine.
>     I would say for quad-cores the separate physical CPUs would be
>     processors 0,4,8, etc,
>     or 1,5,7, etc, and so on (integers module 4), with "processors"
>     0,1,2,3 being the four cores
>     in the first physical CPU, and so on.
>     In /proc/cpuinfo look for the keyword "processor".
>     These are the numbers you need to use in "taskset -c".
>     However, other helpful information comes in the keywords "physical
>     id",
>     "core id", "siblings", and "cpu cores".
>     They will allow you to map cores and physical CPUs to
>     the "processor" number.
>
>     The "taskset"  command line above worked in one of my standalone
>     multicore machines,
>     and I hope a variant of it will work on your machine also.
>     It works with the "mpiexec" that comes with the MPICH
>     distribution, and also with
>     the "mpiexec" associated to the Torque/PBS batch system, which is
>     nice for clusters as well.
>
>     "Taskset" can change the default behavior of the Linux scheduler,
>     which is to allow processes to
>     be moved from one core/CPU to another during execution.
>     The scheduler does this to ensure optimal CPU use (i.e. load balance).
>     With taskset you can force execution to happen on the cores you
>     specify on the command line,
>     i.e. you can force the so called "CPU affinity" you wish.
>     Note that the "taskset" man page uses both the terms "CPU" and
>     "processor", and doesn't use the term "core",
>     which may be  a bit confusing. Make no mistake, "processor" and
>     "CPU" there stand for what we've been calling "core" here.
>
>     Other postings that you may find useful on closely related topics are:
>
>     http://www.ibm.com/developerworks/linux/library/l-scheduler/
>     <http://www.ibm.com/developerworks/linux/library/l-scheduler/>
>     http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
>     <http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html>
>
>     I hope this helps,
>
>     Still, we have a long way to go to sort out how much of the
>     multicore bottleneck can
>     be ascribed to lack of memory bandwidth, and how much may be
>      perhaps associated to how
>     memcpy is compiled by different compilers,
>     or if there are other components of this problem that we don't see
>     now.
>
>     Maybe our community won't find a solution to Zach's problem: "Why
>     is my quad core slower than cluster?"
>     However, I hope that through collaboration, and by sharing
>     information,
>     we may be able to nail down the root of the problem,
>     and perhaps to find ways to improve the alarmingly bad performance
>     some of us have reported on multicore machines.
>
>
>     Gus Correa
>
>     -- 
>     ------------------------------ ---------------------------------------
>     Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>     <mailto:gus at ldeo.columbia.edu>
>     Lamont-Doherty Earth Observatory - Columbia University
>     P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>     ---------------------------------------------------------------------
>
>
>


Brian Dobbins wrote:

> Hi everyone,
>
>   I'll echo the sentiments expressed by Tan and a few others that the 
> culprit here, at least for Gaetano's code, is probably the memory 
> bandwidth.  The FDTD applications I've seen tend to be 
> bandwidth-hungry, and the current Intel quad cores are not very good 
> at this, especially those with the 1333 Mhz FSB such as the E5345.  
> Some of the newer models support the 1600 Mhz FSB speeds and tend to 
> deliver better results.... for example, using the SPEC fp_rate 
> benchmarks (and specifically that of GemsFDTD) as a really rough 
> approximation to parallel performance, an E5450 at 3.0 Ghz scores 29.0 
> vs. 36.7 for the E5472 processor.  Same chip, but faster memory, 
> resulting in 26.5% faster performance overall.  [Note, I used the 
> Supermicro results for this, but you can find similar results for any 
> of them, I imagine.]
>
>   This isn't limited to FDTD codes, either... CFD-like applications 
> are notoriously bandwidth hungry, and using SPEC scores again, WRF 
> gives us 51.9 vs. 70.2 for the same configurations above.  Again, 
> these are the same processors, same compilers, and same tests done by 
> guys who definitely like to eek the utmost performance from their rig, 
> and this shows that simply adding faster memory improves things 
> measurably... 35% in this case.  Since fp_rate isn't really the same 
> as parallel performance, though, let's switch to some first-hand 
> measurements - I was recently running some WRF models on a system here 
> using the E5440 (2.83 Ghz) processors, and here are the results:
>
>   Running on 128 cores as 16 nodes x 8 cores per node:   22.8 seconds 
> / step
>                     128 cores as 32 nodes x 4 cores per node:  11.9 
> seconds / step
>                     128 cores as 64 nodes x 2 cores per node:  10.4 
> seconds / step
>
>   ... As you can see, the 'best' performance comes from using only 2 
> cores per node!  In fact, I was able to run on 16 nodes and 4 cores 
> per node, for a total of 64 cores, and it was only 2% slower than 
> running on 128 cores (as 16 x 8).   I didn't fiddle with task 
> placement since MPI and the OS are /generally/ pretty smart about such 
> things, and the data points towards memory bandwidth being a key issue 
> anyways.  Hopefully having some of these 'hard numbers' can ease your 
> burden so you don't go crazy trying to find a reason in MPICH, the OS, 
> etc.  ;)
>
>   Put another way, if you're concerned that your quad-core box isn't 
> working properly via MPI, you can write up a code (or download a code) 
> that does something that is not limited by bandwidth - generating 
> random numbers, for example - and run it, and you /should/ see a fully 
> linear speedup (within the constraints of Amdahl's Law).  So the only 
> recommendations I'd make are:
>
>   1) Use an up-to-date MPI implementation (such as MPICH2 1.0.7) and 
> OS since they'll probably be 'smarter' about task placement than older 
> versions
>   2) Try using the Intel compilers if you haven't done so already 
> since they tend to be superior to gfortran (and many times gcc as well)
>   3) If you're buying hardware soon, look at the (more expensive) 1600 
> Mhz FSB boards / chips from Intel.
>
>   Hope that's useful! 
>
>   Cheers,
>   - Brian
>
>
> Brian Dobbins
> Yale Engineering HPC




More information about the mpich-discuss mailing list