[mpich-discuss] Why is my quad core slower than cluster

Tue Jul 15 18:06:18 CDT 2008

A recent (long) discussion about numactl and taskset on the beowulf  
mailing list:
http://www.beowulf.org/archive/2008-June/021810.html

On Jul 15, 2008, at 1:35 PM, chong tan wrote:

> Eric,
>
> I know you are referring me as the one not sharing.  I am no expert  
> on MP, but someone who have done his homeworks.  I like to share,  
> but the NDAs and company policy say no.
>
> You have good points and did some good experiements.  That is what  
> I expect most MP designers and users to have done at the first place.
>
> There answers to the original question are simple :
>
> - on 2Xquad, you have one memory system, while on cluster, you have  
> 8 memory systems, the total bandwidth favor the cluster considerably.
>
> - on cluster, there is not way for the process to be context  
> switched, while that can happen on 2XQuad.  When this happens, live  
> is bad.
>
> - The only thing that favor the SMP is the cost of communication  
> and shared memory.
>
>
> There are more factors, Thea rt is balancing them to your favor.   
> In a way, the X86 Quad are not designed to let us load it up with  
> fat adnd heavy processes.  That is what I have been saying all  
> along: know your HW first.  Your MP solution should come second.   
> Whatever utilities you can find will help put the solution together.
>
>
> So, the problem is not MPI in this case.
>
>
> tan
>
>
>
> --- On Mon, 7/14/08, Eric A. Borisch <eborisch at ieee.org> wrote:
>
> From: Eric A. Borisch <eborisch at ieee.org>
> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
> To: mpich-discuss at mcs.anl.gov
> Date: Monday, July 14, 2008, 9:36 PM
>
> Gus,
>
> Information sharing is truly the point of the mailing list. Useful  
> messages should ask questions or provide answers! :)
>
> Someone mentioned STREAM benchmarks (memory BW benchmarks) a little  
> while back. I did these when our new system came in a while ago, so  
> I dug them back out.
>
> This (STREAM) can be compiled to use MPI, but it is only a  
> synchronization tool, the benchmark is still a memory bus test  
> (each task is trying to run through memory, but this is not an MPI  
> communication test.)
>
> My results on a dual E5472 machine (Two Quad-core 3Ghz packages;  
> 1600MHz bus; 8 total cores)
>
> Results (each set are [1..8] processes in order), double-precision  
> array size = 20,000,000, run through 10 times.
>
> Function     Rate (MB/s)  Avg time   Min time  Max time
> Copy:       2962.6937      0.1081      0.1080      0.1081
> Copy:       5685.3008      0.1126      0.1126      0.1128
> Copy:       5484.6846      0.1751      0.1750      0.1751
> Copy:       7085.7959      0.1809      0.1806      0.1817
> Copy:       5981.6033      0.2676      0.2675      0.2676
> Copy:       7071.2490      0.2718      0.2715      0.2722
> Copy:       6537.4934      0.3427      0.3426      0.3428
> Copy:       7423.4545      0.3451      0.3449      0.3455
>
> Scale:      3011.8445      0.1063      0.1062      0.1063
> Scale:      5675.8162      0.1128      0.1128      0.1129
> Scale:      5474.8854      0.1754      0.1753      0.1754
> Scale:      7068.6204      0.1814      0.1811      0.1819
> Scale:      5974.6112      0.2679      0.2678      0.2680
> Scale:      7063.8307      0.2721      0.2718      0.2725
> Scale:      6533.4473      0.3430      0.3429      0.3431
> Scale:      7418.6128      0.3453      0.3451      0.3456
>
> Add:        3184.3129      0.1508      0.1507      0.1508
> Add:        5892.1781      0.1631      0.1629      0.1633
> Add:        5588.0229      0.2577      0.2577      0.2578
> Add:        7275.0745      0.2642      0.2639      0.2646
> Add:        6175.7646      0.3887      0.3886      0.3889
> Add:        7262.7112      0.3970      0.3965      0.3976
> Add:        6687.7658      0.5025      0.5024      0.5026
> Add:        7599.2516      0.5057      0.5053      0.5062
>
> Triad:      3224.7856      0.1489      0.1488      0.1489
> Triad:      6021.2613      0.1596      0.1594      0.1598
> Triad:      5609.9260      0.2567      0.2567      0.2568
> Triad:      7293.2790      0.2637      0.2633      0.2641
> Triad:      6185.4376      0.3881      0.3880      0.3881
> Triad:      7279.1231      0.3958      0.3957      0.3961
> Triad:      6691.8560      0.5022      0.5021      0.5022
> Triad:      7604.1238      0.5052      0.5050      0.5057
>
> These work out to (~):
> 1x
> 1.9x
> 1.8x
> 2.3x
> 1.9x
> 2.2x
> 2.1x
> 2.4x
>
> for [1..8] cores.
>
> As you can see, it doesn't take eight cores to saturate the bus,  
> even with a 1600MHz bus. Four of the eight cores running does this  
> trick.
>
> With all that said, there are still advantages to be had with the  
> multicore chipsets, but only if you're not blowing full tilt  
> through memory. If it can fit the problem, do more inside a loop  
> rather than running multiple loops over the same memory.
>
> For reference, here's what using the osu_mbw_mr test (from MVAPICH2  
> 1.0.2; I also have a cluster running nearby :) compiled on MPICH2  
> (1.0.7rc1 with nemesis provides this performance from one/two/four  
> pairs (2/4/8 processes) of producer/consumers:
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 1 ] [ window size: 64 ]
>
> #  Size    MB/sec    Messages/sec
>       1      1.08   1076540.83
>       2      2.14   1068102.24
>       4      3.99    997382.24
>       8      7.97    996419.66
>      16     15.95    996567.63
>      32     31.67    989660.29
>      64     62.73    980084.91
>     128    124.12    969676.18
>     256    243.59    951527.62
>     512    445.52    870159.34
>    1024    810.28    791284.80
>    2048   1357.25    662721.78
>    4096   1935.08    472431.28
>    8192   2454.29    299596.49
>   16384   2717.61    165869.84
>   32768   2900.23     88507.85
>   65536   2279.71     34785.63
>  131072   2540.51     19382.53
>  262144   1335.16      5093.21
>  524288   1364.05      2601.72
> 1048576   1378.39      1314.53
> 2097152   1380.78       658.41
> 4194304   1343.48       320.31
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 2 ] [ window size: 64 ]
>
> #  Size    MB/sec    Messages/sec
>       1      2.15   2150580.48
>       2      4.22   2109761.12
>       4      7.84   1960742.53
>       8     15.80   1974733.92
>      16     31.38   1961100.64
>      32     62.32   1947654.32
>      64    123.39   1928000.11
>     128    243.19   1899957.22
>     256    475.32   1856721.12
>     512    856.90   1673642.10
>    1024   1513.19   1477721.26
>    2048   2312.91   1129351.07
>    4096   2891.21    705861.12
>    8192   3267.49    398863.98
>   16384   3400.64    207558.54
>   32768   3519.74    107413.93
>   65536   3141.80     47940.04
>  131072   3368.65     25700.76
>  262144   2211.53      8436.31
>  524288   2264.90      4319.95
> 1048576   2282.69      2176.94
> 2097152   2250.72      1073.23
> 4194304   2087.00       497.58
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 4 ] [ window size: 64 ]
>
> #  Size    MB/sec    Messages/sec
>       1      3.65   3651934.64
>       2      8.16   4080341.34
>       4     15.66   3914908.02
>       8     31.32   3915621.85
>      16     62.67   3916764.51
>      32    124.37   3886426.18
>      64    246.38   3849640.84
>     128    486.39   3799914.44
>     256    942.40   3681232.25
>     512   1664.21   3250414.19
>    1024   2756.50   2691891.86
>    2048   3829.45   1869848.54
>    4096   4465.25   1090148.56
>    8192   4777.45    583184.51
>   16384   4822.75    294357.30
>   32768   4829.77    147392.80
>   65536   4556.93     69533.18
>  131072   4789.32     36539.60
>  262144   3631.68     13853.75
>  524288   3679.31      7017.72
> 1048576   3553.61      3388.99
> 2097152   3113.12      1484.45
> 4194304   2452.69       584.77
>
> So from a messaging standpoint, you can see that you squeeze more  
> data through with more processes; I'd guess that this is because  
> there's processing to be done within MPI to move the data, and a  
> lot of the bookkeeping steps probably cache well (updating the same  
> status structure on a communication multiple times; perhaps reusing  
> the structure for subsequent transfers and finding it still in  
> cache) so the performance scaling is not completely FSB bound.
>
> I'm sure there's plenty of additional things that could be done  
> here to test different CPU to process layouts, etc, but in testing  
> my own real-world code, I've found that, unfortunately, "it  
> depends." I have some code that nearly scales linearly (multiple  
> computationally expensive operations inside the innermost loop) and  
> some that scales like the STREAM results above ("add one to the  
> next 20 million points") ...
>
> As always, your mileage may vary. If your speedup looks like the  
> STREAM numbers above, you're likely memory bound. Try to  
> reformulate your problem to go through memory slower but with more  
> done each pass, or invest in a cluster. At some point -- for some  
> problems -- you can't beat more memory busses!
>
> Cheers,
>  Eric Borisch
>
> --
>  borisch.eric at mayo.edu
>  MRI Research
>  Mayo Clinic
>
> On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa <gus at ldeo.columbia.edu>  
> wrote:
> Hello Sami and list
>
> Oh, well, as you see, an expert who claims to know the answers to  
> these problems
> seems not to be willing to share these answers with less  
> knowledgeable MPI users like us.
> So, maybe we can find the answers ourselves, not by individual  
> "homework" brainstorming,
> but through community collaboration and generous information sharing,
> which is the hallmark of this mailing list.
>
> I Googled around today to find out how to assign MPI processes to  
> specific processors,
> and I found some interesting information on how to do it.
>
> Below is a link to a posting from the computational fluid dynamics  
> (CFD) community that may be of interest.
> Not surprisingly, they are struggling with the same type of  
> problems all of us have,
> including how to tie MPI processes to specific processors:
>
> http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/ 
> 1/5949.html#POST18006
>
> I would summarize these problems as related to three types of  
> bottleneck:
>
> 1) Multicore processor bottlenecks (standalone machines and clusters)
> 2) Network fabric bottlenecks (clusters)
> 3) File system bottlenecks (clusters)
>
> All three types of problems are due to contention for some type of  
> system resource
> by the MPI processes that take part in a computation/program.
>
> Our focus on this thread, started by Zach, has been on problem 1),
> although most of us may need to look into problems 2) and 3) sooner  
> or later.
> (I have all the three of them already!)
>
> The CFD folks use MPI as we do.
> They seem to use another MPI flavor, but the same problems are there.
> The problems are not caused by MPI itself, but they become apparent  
> when you run MPI programs.
> That has been my experience too.
>
> As for how to map the MPI processes to specific processors (or cores),
> the key command seems to be "taskset", as my googling afternoon  
> showed.
> Try "man taskset" for more info.
>
> For a standalone machine like yours, something like the command  
> line below should work to
> force execution on "processors" 0 and 2 (which in my case are two  
> different physical CPUs):
>
> mpiexec -n 2 taskset -c 0,2  my_mpi_program
>
> You need to check on your computer ("more /proc/cpuinfo")
> what are the exact "processor" numbers that correspond to separate  
> physical CPUs. Most likely they are the even numbered processors  
> only, or the odd numbered only,
> since you have dual-core CPUs (integers module 2), with  
> "processors" 0,1 being the four
> cores of the first physical CPU, "processors" 2,3 the cores of the  
> second physical CPU, and so on.
> At least, this is what I see on my dual-core dual-processor machine.
> I would say for quad-cores the separate physical CPUs would be  
> processors 0,4,8, etc,
> or 1,5,7, etc, and so on (integers module 4), with "processors"  
> 0,1,2,3 being the four cores
> in the first physical CPU, and so on.
> In /proc/cpuinfo look for the keyword "processor".
> These are the numbers you need to use in "taskset -c".
> However, other helpful information comes in the keywords "physical  
> id",
> "core id", "siblings", and "cpu cores".
> They will allow you to map cores and physical CPUs to
> the "processor" number.
>
> The "taskset"  command line above worked in one of my standalone  
> multicore machines,
> and I hope a variant of it will work on your machine also.
> It works with the "mpiexec" that comes with the MPICH distribution,  
> and also with
> the "mpiexec" associated to the Torque/PBS batch system, which is  
> nice for clusters as well.
>
> "Taskset" can change the default behavior of the Linux scheduler,  
> which is to allow processes to
> be moved from one core/CPU to another during execution.
> The scheduler does this to ensure optimal CPU use (i.e. load balance).
> With taskset you can force execution to happen on the cores you  
> specify on the command line,
> i.e. you can force the so called "CPU affinity" you wish.
> Note that the "taskset" man page uses both the terms "CPU" and  
> "processor", and doesn't use the term "core",
> which may be  a bit confusing. Make no mistake, "processor" and  
> "CPU" there stand for what we've been calling "core" here.
>
> Other postings that you may find useful on closely related topics are:
>
> http://www.ibm.com/developerworks/linux/library/l-scheduler/
> http://www.cyberciti.biz/tips/setting-processor-affinity-certain- 
> task-or-process.html
>
> I hope this helps,
>
> Still, we have a long way to go to sort out how much of the  
> multicore bottleneck can
> be ascribed to lack of memory bandwidth, and how much may be   
> perhaps associated to how
> memcpy is compiled by different compilers,
> or if there are other components of this problem that we don't see  
> now.
>
> Maybe our community won't find a solution to Zach's problem: "Why  
> is my quad core slower than cluster?"
> However, I hope that through collaboration, and by sharing  
> information,
> we may be able to nail down the root of the problem,
> and perhaps to find ways to improve the alarmingly bad performance
> some of us have reported on multicore machines.
>
>
> Gus Correa
>
> -- 
> ------------------------------ ---------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080715/a7b53d56/attachment.htm>