[mpich-discuss] Why is my quad core slower than cluster

Wed Jul 16 20:31:18 CDT 2008

And are we to believe that your employer's NDA doesn't permit you to say WHICH answer is wrong?  Whew!

Gib

chong tan wrote:
> Just FYI,
> 
> from my knowledge, at least 1 answer to the question in that thread is 
> absolutely wrong, according to HW information on hand.  Some of the info 
> in that thread are not applicable across the board, and the original 
> question : threaded application, is not answered.
> 
>  
> 
> whether to use numactl on NUMA system is situation dependent.  In 
> general, numactl is bad if you over subscribe the system.
> 
>  
> 
> tan
> 
> 
> 
> --- On *Tue, 7/15/08, Robert Kubrick /<robertkubrick at gmail.com>/* wrote:
> 
>     From: Robert Kubrick <robertkubrick at gmail.com>
>     Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
>     To: mpich-discuss at mcs.anl.gov
>     Date: Tuesday, July 15, 2008, 4:06 PM
> 
>     A recent (long) discussion about numactl and taskset on the beowulf
>     mailing list:
>     http://www.beowulf.org/archive/2008-June/021810.html
> 
>     On Jul 15, 2008, at 1:35 PM, chong tan wrote:
> 
>>     Eric,
>>
>>     I know you are referring me as the one not sharing.  I am no
>>     expert on MP, but someone who have done his homeworks.  I like to
>>     share, but the NDAs and company policy say no.   
>>
>>     You have good points and did some good experiements.  That is what
>>     I expect most MP designers and users to have done at the first place.
>>
>>     There answers to the original question are simple :
>>
>>     - on 2Xquad, you have one memory system, while on cluster, you
>>     have 8 memory systems, the total bandwidth favor the cluster
>>     considerably.
>>
>>     - on cluster, there is not way for the process to be context
>>     switched, while that can happen on 2XQuad.  When this happens,
>>     live is bad.
>>
>>     - The only thing that favor the SMP is the cost of communication
>>     and shared memory.
>>
>>      
>>
>>     There are more factors, Thea rt is balancing them to your favor. 
>>     In a way, the X86 Quad are not designed to let us load it up with
>>     fat adnd heavy processes.  That is what I have been saying all
>>     along: know your HW first.  Your MP solution should come second. 
>>     Whatever utilities you can find will help put the solution together.
>>
>>      
>>
>>     So, the problem is not MPI in this case.
>>
>>      
>>
>>     tan
>>
>>
>>
>>     --- On *Mon, 7/14/08, Eric A. Borisch /<eborisch at ieee.org
>>     <mailto:eborisch at ieee.org>>/* wrote:
>>
>>         From: Eric A. Borisch <eborisch at ieee.org
>>         <mailto:eborisch at ieee.org>>
>>         Subject: Re: [mpich-discuss] Why is my quad core slower than
>>         cluster
>>         To: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>         Date: Monday, July 14, 2008, 9:36 PM
>>
>>         Gus,
>>
>>         Information sharing is truly the point of the mailing list.
>>         Useful messages should ask questions or provide answers! :)
>>
>>         Someone mentioned STREAM benchmarks (memory BW benchmarks) a
>>         little while back. I did these when our new system came in a
>>         while ago, so I dug them back out.
>>
>>         This (STREAM) can be compiled to use MPI, but it is only a
>>         synchronization tool, the benchmark is still a memory bus test
>>         (each task is trying to run through memory, but this is not an
>>         MPI communication test.)
>>
>>         My results on a dual E5472 machine (Two Quad-core 3Ghz
>>         packages; 1600MHz bus; 8 total cores)
>>
>>         Results (each set are [1..8] processes in order),
>>         double-precision array size = 20,000,000, run through 10 times.
>>
>>         Function     Rate (MB/s)  Avg time   Min time  Max time
>>         Copy:       2962.6937      0.1081      0.1080      0.1081
>>         Copy:       5685.3008      0.1126      0.1126      0.1128
>>         Copy:       5484.6846      0.1751      0.1750      0.1751
>>         Copy:       7085.7959      0.1809      0.1806      0.1817
>>         Copy:       5981.6033      0.2676      0.2675      0.2676
>>         Copy:       7071.2490      0.2718      0.2715      0.2722
>>         Copy:       6537.4934      0.3427      0.3426      0.3428
>>         Copy:       7423.4545      0.3451      0.3449      0.3455
>>
>>         Scale:      3011.8445      0.1063      0.1062      0.1063
>>         Scale:      5675.8162      0.1128      0.1128      0.1129
>>         Scale:      5474.8854      0.1754      0.1753      0.1754
>>         Scale:      7068.6204      0.1814      0.1811      0.1819
>>         Scale:      5974.6112      0.2679      0.2678      0.2680
>>         Scale:      7063.8307      0.2721      0.2718      0.2725
>>         Scale:      6533.4473      0.3430      0.3429      0.3431
>>         Scale:      7418.6128      0.3453      0.3451      0.3456
>>
>>         Add:        3184.3129      0.1508      0.1507      0.1508
>>         Add:        5892.1781      0.1631      0.1629      0.1633
>>         Add:        5588.0229      0.2577      0.2577      0.2578
>>         Add:        7275.0745      0.2642      0.2639      0.2646
>>         Add:        6175.7646      0.3887      0.3886      0.3889
>>         Add:        7262.7112      0.3970      0.3965      0.3976
>>         Add:        6687.7658      0.5025      0.5024      0.5026
>>         Add:        7599.2516      0.5057      0.5053      0.5062
>>
>>         Triad:      3224.7856      0.1489      0.1488      0.1489
>>         Triad:      6021.2613      0.1596      0.1594      0.1598
>>         Triad:      5609.9260      0.2567      0.2567      0.2568
>>         Triad:      7293.2790      0.2637      0.2633      0.2641
>>         Triad:      6185.4376      0.3881      0.3880      0.3881
>>         Triad:      7279.1231      0.3958      0.3957      0.3961
>>         Triad:      6691.8560      0.5022      0.5021      0.5022
>>         Triad:      7604.1238      0.5052      0.5050      0.5057
>>
>>         These work out to (~):
>>         1x
>>         1.9x
>>         1.8x
>>         2.3x
>>         1.9x
>>         2.2x
>>         2.1x
>>         2.4x
>>          
>>         for [1..8] cores.
>>
>>         As you can see, it doesn't take eight cores to saturate the
>>         bus, even with a 1600MHz bus. Four of the eight cores running
>>         does this trick.
>>
>>         With all that said, there are still advantages to be had with
>>         the multicore chipsets, but only if you're not blowing full
>>         tilt through memory. If it can fit the problem, do more inside
>>         a loop rather than running multiple loops over the same memory. 
>>
>>         For reference, here's what using the osu_mbw_mr test (from
>>         MVAPICH2 1.0.2; I also have a cluster running nearby :)
>>         compiled on MPICH2 (1.0.7rc1 with nemesis provides this
>>         performance from one/two/four pairs (2/4/8 processes) of
>>         producer/consumers:
>>
>>         # OSU MPI Multi BW / Message Rate Test (Version 1.0)
>>         # [ pairs: 1 ] [ window size: 64 ]
>>
>>         #  Size    MB/sec    Messages/sec
>>               1      1.08   1076540.83
>>               2      2.14   1068102.24
>>               4      3.99    997382.24
>>               8      7.97    996419.66
>>              16     15.95    996567.63
>>              32     31.67    989660.29
>>              64     62.73    980084.91
>>             128    124.12    969676.18
>>             256    243.59    951527.62
>>             512    445.52    870159.34
>>            1024    810.28    791284.80
>>            2048   1357.25    662721.78
>>            4096   1935.08    472431.28
>>            8192   2454.29    299596.49
>>           16384   2717.61    165869.84
>>           32768   2900.23     88507.85
>>           65536   2279.71     34785.63
>>          131072   2540.51     19382.53
>>          262144   1335.16      5093.21
>>          524288   1364.05      2601.72
>>         1048576   1378.39      1314.53
>>         2097152   1380.78       658.41
>>         4194304   1343.48       320.31
>>
>>         # OSU MPI Multi BW / Message Rate Test (Version 1.0)
>>         # [ pairs: 2 ] [ window size: 64 ]
>>
>>         #  Size    MB/sec    Messages/sec
>>               1      2.15   2150580.48
>>               2      4.22   2109761.12
>>               4      7.84   1960742.53
>>               8     15.80   1974733.92
>>              16     31.38   1961100.64
>>              32     62.32   1947654.32
>>              64    123.39   1928000.11
>>             128    243.19   1899957.22
>>             256    475.32   1856721.12
>>             512    856.90   1673642.10
>>            1024   1513.19   1477721.26
>>            2048   2312.91   1129351.07
>>            4096   2891.21    705861.12
>>            8192   3267.49    398863.98
>>           16384   3400.64    207558.54
>>           32768   3519.74    107413.93
>>           65536   3141.80     47940.04
>>          131072   3368.65     25700.76
>>          262144   2211.53      8436.31
>>          524288   2264.90      4319.95
>>         1048576   2282.69      2176.94
>>         2097152   2250.72      1073.23
>>         4194304   2087.00       497.58
>>
>>         # OSU MPI Multi BW / Message Rate Test (Version 1.0)
>>         # [ pairs: 4 ] [ window size: 64 ]
>>
>>         #  Size    MB/sec    Messages/sec
>>               1      3.65   3651934.64
>>               2      8.16   4080341.34
>>               4     15.66   3914908.02
>>               8     31.32   3915621.85
>>              16     62.67   3916764.51
>>              32    124.37   3886426.18
>>              64    246.38   3849640.84
>>             128    486.39   3799914.44
>>             256    942.40   3681232.25
>>             512   1664.21   3250414.19
>>            1024   2756.50   2691891.86
>>            2048   3829.45   1869848.54
>>            4096   4465.25   1090148.56
>>            8192   4777.45    583184.51
>>           16384   4822.75    294357.30
>>           32768   4829.77    147392.80
>>           65536   4556.93     69533.18
>>          131072   4789.32     36539.60
>>          262144   3631.68     13853.75
>>          524288   3679.31      7017.72
>>         1048576   3553.61      3388.99
>>         2097152   3113.12      1484.45
>>         4194304   2452.69       584.77
>>
>>         So from a messaging standpoint, you can see that you squeeze
>>         more data through with more processes; I'd guess that this is
>>         because there's processing to be done within MPI to move the
>>         data, and a lot of the bookkeeping steps probably cache well
>>         (updating the same status structure on a communication
>>         multiple times; perhaps reusing the structure for subsequent
>>         transfers and finding it still in cache) so the performance
>>         scaling is not completely FSB bound.
>>
>>         I'm sure there's plenty of additional things that could be
>>         done here to test different CPU to process layouts, etc, but
>>         in testing my own real-world code, I've found that,
>>         unfortunately, "it depends." I have some code that nearly
>>         scales linearly (multiple computationally expensive operations
>>         inside the innermost loop) and some that scales like the
>>         STREAM results above ("add one to the next 20 million points") ...
>>
>>         As always, your mileage may vary. If your speedup looks like
>>         the STREAM numbers above, you're likely memory bound. Try to
>>         reformulate your problem to go through memory slower but with
>>         more done each pass, or invest in a cluster. At some point --
>>         for some problems -- you can't beat more memory busses!
>>
>>         Cheers,
>>          Eric Borisch
>>
>>         --
>>          borisch.eric at mayo.edu <mailto:borisch.eric at mayo.edu>
>>          MRI Research
>>          Mayo Clinic
>>
>>         On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa
>>         <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>> wrote:
>>
>>             Hello Sami and list
>>
>>             Oh, well, as you see, an expert who claims to know the
>>             answers to these problems
>>             seems not to be willing to share these answers with less
>>             knowledgeable MPI users like us.
>>             So, maybe we can find the answers ourselves, not by
>>             individual "homework" brainstorming,
>>             but through community collaboration and generous
>>             information sharing,
>>             which is the hallmark of this mailing list.
>>
>>             I Googled around today to find out how to assign MPI
>>             processes to specific processors,
>>             and I found some interesting information on how to do it.
>>
>>             Below is a link to a posting from the computational fluid
>>             dynamics (CFD) community that may be of interest.
>>             Not surprisingly, they are struggling with the same type
>>             of problems all of us have,
>>             including how to tie MPI processes to specific processors:
>>
>>             http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006
>>             <http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006>
>>
>>             I would summarize these problems as related to three types
>>             of bottleneck:
>>
>>             1) Multicore processor bottlenecks (standalone machines
>>             and clusters)
>>             2) Network fabric bottlenecks (clusters)
>>             3) File system bottlenecks (clusters)
>>
>>             All three types of problems are due to contention for some
>>             type of system resource
>>             by the MPI processes that take part in a computation/program.
>>
>>             Our focus on this thread, started by Zach, has been on
>>             problem 1),
>>             although most of us may need to look into problems 2) and
>>             3) sooner or later.
>>             (I have all the three of them already!)
>>
>>             The CFD folks use MPI as we do.
>>             They seem to use another MPI flavor, but the same problems
>>             are there.
>>             The problems are not caused by MPI itself, but they become
>>             apparent when you run MPI programs.
>>             That has been my experience too.
>>
>>             As for how to map the MPI processes to specific processors
>>             (or cores),
>>             the key command seems to be "taskset", as my googling
>>             afternoon showed.
>>             Try "man taskset" for more info.
>>
>>             For a standalone machine like yours, something like the
>>             command line below should work to
>>             force execution on "processors" 0 and 2 (which in my case
>>             are two different physical CPUs):
>>
>>             mpiexec -n 2 taskset -c 0,2  my_mpi_program
>>
>>             You need to check on your computer ("more /proc/cpuinfo")
>>             what are the exact "processor" numbers that correspond to
>>             separate physical CPUs. Most likely they are the even
>>             numbered processors only, or the odd numbered only,
>>             since you have dual-core CPUs (integers module 2), with
>>             "processors" 0,1 being the four
>>             cores of the first physical CPU, "processors" 2,3 the
>>             cores of the second physical CPU, and so on.
>>             At least, this is what I see on my dual-core
>>             dual-processor machine.
>>             I would say for quad-cores the separate physical CPUs
>>             would be processors 0,4,8, etc,
>>             or 1,5,7, etc, and so on (integers module 4), with
>>             "processors" 0,1,2,3 being the four cores
>>             in the first physical CPU, and so on.
>>             In /proc/cpuinfo look for the keyword "processor".
>>             These are the numbers you need to use in "taskset -c".
>>             However, other helpful information comes in the keywords
>>             "physical id",
>>             "core id", "siblings", and "cpu cores".
>>             They will allow you to map cores and physical CPUs to
>>             the "processor" number.
>>
>>             The "taskset"  command line above worked in one of my
>>             standalone multicore machines,
>>             and I hope a variant of it will work on your machine also.
>>             It works with the "mpiexec" that comes with the MPICH
>>             distribution, and also with
>>             the "mpiexec" associated to the Torque/PBS batch system,
>>             which is nice for clusters as well.
>>
>>             "Taskset" can change the default behavior of the Linux
>>             scheduler, which is to allow processes to
>>             be moved from one core/CPU to another during execution.
>>             The scheduler does this to ensure optimal CPU use (i.e.
>>             load balance).
>>             With taskset you can force execution to happen on the
>>             cores you specify on the command line,
>>             i.e. you can force the so called "CPU affinity" you wish.
>>             Note that the "taskset" man page uses both the terms "CPU"
>>             and "processor", and doesn't use the term "core",
>>             which may be  a bit confusing. Make no mistake,
>>             "processor" and "CPU" there stand for what we've been
>>             calling "core" here.
>>
>>             Other postings that you may find useful on closely related
>>             topics are:
>>
>>             http://www.ibm.com/developerworks/linux/library/l-scheduler/
>>             <http://www.ibm.com/developerworks/linux/library/l-scheduler/>
>>             http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
>>             <http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html>
>>
>>             I hope this helps,
>>
>>             Still, we have a long way to go to sort out how much of
>>             the multicore bottleneck can
>>             be ascribed to lack of memory bandwidth, and how much may
>>             be  perhaps associated to how
>>             memcpy is compiled by different compilers,
>>             or if there are other components of this problem that we
>>             don't see now.
>>
>>             Maybe our community won't find a solution to Zach's
>>             problem: "Why is my quad core slower than cluster?"
>>             However, I hope that through collaboration, and by sharing
>>             information,
>>             we may be able to nail down the root of the problem,
>>             and perhaps to find ways to improve the alarmingly bad
>>             performance
>>             some of us have reported on multicore machines.
>>
>>
>>             Gus Correa
>>
>>             -- 
>>             ------------------------------
>>             ---------------------------------------
>>             Gustavo J. Ponce Correa, PhD - Email:
>>             gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>>             Lamont-Doherty Earth Observatory - Columbia University
>>             P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>             ---------------------------------------------------------------------
>>
>>
>>
>>
> 
>