[mpich-discuss] Why is my quad core slower than cluster
Robert Kubrick
robertkubrick at gmail.com
Tue Jul 15 18:06:18 CDT 2008
A recent (long) discussion about numactl and taskset on the beowulf
mailing list:
http://www.beowulf.org/archive/2008-June/021810.html
On Jul 15, 2008, at 1:35 PM, chong tan wrote:
> Eric,
>
> I know you are referring me as the one not sharing. I am no expert
> on MP, but someone who have done his homeworks. I like to share,
> but the NDAs and company policy say no.
>
> You have good points and did some good experiements. That is what
> I expect most MP designers and users to have done at the first place.
>
> There answers to the original question are simple :
>
> - on 2Xquad, you have one memory system, while on cluster, you have
> 8 memory systems, the total bandwidth favor the cluster considerably.
>
> - on cluster, there is not way for the process to be context
> switched, while that can happen on 2XQuad. When this happens, live
> is bad.
>
> - The only thing that favor the SMP is the cost of communication
> and shared memory.
>
>
> There are more factors, Thea rt is balancing them to your favor.
> In a way, the X86 Quad are not designed to let us load it up with
> fat adnd heavy processes. That is what I have been saying all
> along: know your HW first. Your MP solution should come second.
> Whatever utilities you can find will help put the solution together.
>
>
> So, the problem is not MPI in this case.
>
>
> tan
>
>
>
> --- On Mon, 7/14/08, Eric A. Borisch <eborisch at ieee.org> wrote:
>
> From: Eric A. Borisch <eborisch at ieee.org>
> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
> To: mpich-discuss at mcs.anl.gov
> Date: Monday, July 14, 2008, 9:36 PM
>
> Gus,
>
> Information sharing is truly the point of the mailing list. Useful
> messages should ask questions or provide answers! :)
>
> Someone mentioned STREAM benchmarks (memory BW benchmarks) a little
> while back. I did these when our new system came in a while ago, so
> I dug them back out.
>
> This (STREAM) can be compiled to use MPI, but it is only a
> synchronization tool, the benchmark is still a memory bus test
> (each task is trying to run through memory, but this is not an MPI
> communication test.)
>
> My results on a dual E5472 machine (Two Quad-core 3Ghz packages;
> 1600MHz bus; 8 total cores)
>
> Results (each set are [1..8] processes in order), double-precision
> array size = 20,000,000, run through 10 times.
>
> Function Rate (MB/s) Avg time Min time Max time
> Copy: 2962.6937 0.1081 0.1080 0.1081
> Copy: 5685.3008 0.1126 0.1126 0.1128
> Copy: 5484.6846 0.1751 0.1750 0.1751
> Copy: 7085.7959 0.1809 0.1806 0.1817
> Copy: 5981.6033 0.2676 0.2675 0.2676
> Copy: 7071.2490 0.2718 0.2715 0.2722
> Copy: 6537.4934 0.3427 0.3426 0.3428
> Copy: 7423.4545 0.3451 0.3449 0.3455
>
> Scale: 3011.8445 0.1063 0.1062 0.1063
> Scale: 5675.8162 0.1128 0.1128 0.1129
> Scale: 5474.8854 0.1754 0.1753 0.1754
> Scale: 7068.6204 0.1814 0.1811 0.1819
> Scale: 5974.6112 0.2679 0.2678 0.2680
> Scale: 7063.8307 0.2721 0.2718 0.2725
> Scale: 6533.4473 0.3430 0.3429 0.3431
> Scale: 7418.6128 0.3453 0.3451 0.3456
>
> Add: 3184.3129 0.1508 0.1507 0.1508
> Add: 5892.1781 0.1631 0.1629 0.1633
> Add: 5588.0229 0.2577 0.2577 0.2578
> Add: 7275.0745 0.2642 0.2639 0.2646
> Add: 6175.7646 0.3887 0.3886 0.3889
> Add: 7262.7112 0.3970 0.3965 0.3976
> Add: 6687.7658 0.5025 0.5024 0.5026
> Add: 7599.2516 0.5057 0.5053 0.5062
>
> Triad: 3224.7856 0.1489 0.1488 0.1489
> Triad: 6021.2613 0.1596 0.1594 0.1598
> Triad: 5609.9260 0.2567 0.2567 0.2568
> Triad: 7293.2790 0.2637 0.2633 0.2641
> Triad: 6185.4376 0.3881 0.3880 0.3881
> Triad: 7279.1231 0.3958 0.3957 0.3961
> Triad: 6691.8560 0.5022 0.5021 0.5022
> Triad: 7604.1238 0.5052 0.5050 0.5057
>
> These work out to (~):
> 1x
> 1.9x
> 1.8x
> 2.3x
> 1.9x
> 2.2x
> 2.1x
> 2.4x
>
> for [1..8] cores.
>
> As you can see, it doesn't take eight cores to saturate the bus,
> even with a 1600MHz bus. Four of the eight cores running does this
> trick.
>
> With all that said, there are still advantages to be had with the
> multicore chipsets, but only if you're not blowing full tilt
> through memory. If it can fit the problem, do more inside a loop
> rather than running multiple loops over the same memory.
>
> For reference, here's what using the osu_mbw_mr test (from MVAPICH2
> 1.0.2; I also have a cluster running nearby :) compiled on MPICH2
> (1.0.7rc1 with nemesis provides this performance from one/two/four
> pairs (2/4/8 processes) of producer/consumers:
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 1 ] [ window size: 64 ]
>
> # Size MB/sec Messages/sec
> 1 1.08 1076540.83
> 2 2.14 1068102.24
> 4 3.99 997382.24
> 8 7.97 996419.66
> 16 15.95 996567.63
> 32 31.67 989660.29
> 64 62.73 980084.91
> 128 124.12 969676.18
> 256 243.59 951527.62
> 512 445.52 870159.34
> 1024 810.28 791284.80
> 2048 1357.25 662721.78
> 4096 1935.08 472431.28
> 8192 2454.29 299596.49
> 16384 2717.61 165869.84
> 32768 2900.23 88507.85
> 65536 2279.71 34785.63
> 131072 2540.51 19382.53
> 262144 1335.16 5093.21
> 524288 1364.05 2601.72
> 1048576 1378.39 1314.53
> 2097152 1380.78 658.41
> 4194304 1343.48 320.31
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 2 ] [ window size: 64 ]
>
> # Size MB/sec Messages/sec
> 1 2.15 2150580.48
> 2 4.22 2109761.12
> 4 7.84 1960742.53
> 8 15.80 1974733.92
> 16 31.38 1961100.64
> 32 62.32 1947654.32
> 64 123.39 1928000.11
> 128 243.19 1899957.22
> 256 475.32 1856721.12
> 512 856.90 1673642.10
> 1024 1513.19 1477721.26
> 2048 2312.91 1129351.07
> 4096 2891.21 705861.12
> 8192 3267.49 398863.98
> 16384 3400.64 207558.54
> 32768 3519.74 107413.93
> 65536 3141.80 47940.04
> 131072 3368.65 25700.76
> 262144 2211.53 8436.31
> 524288 2264.90 4319.95
> 1048576 2282.69 2176.94
> 2097152 2250.72 1073.23
> 4194304 2087.00 497.58
>
> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
> # [ pairs: 4 ] [ window size: 64 ]
>
> # Size MB/sec Messages/sec
> 1 3.65 3651934.64
> 2 8.16 4080341.34
> 4 15.66 3914908.02
> 8 31.32 3915621.85
> 16 62.67 3916764.51
> 32 124.37 3886426.18
> 64 246.38 3849640.84
> 128 486.39 3799914.44
> 256 942.40 3681232.25
> 512 1664.21 3250414.19
> 1024 2756.50 2691891.86
> 2048 3829.45 1869848.54
> 4096 4465.25 1090148.56
> 8192 4777.45 583184.51
> 16384 4822.75 294357.30
> 32768 4829.77 147392.80
> 65536 4556.93 69533.18
> 131072 4789.32 36539.60
> 262144 3631.68 13853.75
> 524288 3679.31 7017.72
> 1048576 3553.61 3388.99
> 2097152 3113.12 1484.45
> 4194304 2452.69 584.77
>
> So from a messaging standpoint, you can see that you squeeze more
> data through with more processes; I'd guess that this is because
> there's processing to be done within MPI to move the data, and a
> lot of the bookkeeping steps probably cache well (updating the same
> status structure on a communication multiple times; perhaps reusing
> the structure for subsequent transfers and finding it still in
> cache) so the performance scaling is not completely FSB bound.
>
> I'm sure there's plenty of additional things that could be done
> here to test different CPU to process layouts, etc, but in testing
> my own real-world code, I've found that, unfortunately, "it
> depends." I have some code that nearly scales linearly (multiple
> computationally expensive operations inside the innermost loop) and
> some that scales like the STREAM results above ("add one to the
> next 20 million points") ...
>
> As always, your mileage may vary. If your speedup looks like the
> STREAM numbers above, you're likely memory bound. Try to
> reformulate your problem to go through memory slower but with more
> done each pass, or invest in a cluster. At some point -- for some
> problems -- you can't beat more memory busses!
>
> Cheers,
> Eric Borisch
>
> --
> borisch.eric at mayo.edu
> MRI Research
> Mayo Clinic
>
> On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa <gus at ldeo.columbia.edu>
> wrote:
> Hello Sami and list
>
> Oh, well, as you see, an expert who claims to know the answers to
> these problems
> seems not to be willing to share these answers with less
> knowledgeable MPI users like us.
> So, maybe we can find the answers ourselves, not by individual
> "homework" brainstorming,
> but through community collaboration and generous information sharing,
> which is the hallmark of this mailing list.
>
> I Googled around today to find out how to assign MPI processes to
> specific processors,
> and I found some interesting information on how to do it.
>
> Below is a link to a posting from the computational fluid dynamics
> (CFD) community that may be of interest.
> Not surprisingly, they are struggling with the same type of
> problems all of us have,
> including how to tie MPI processes to specific processors:
>
> http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/
> 1/5949.html#POST18006
>
> I would summarize these problems as related to three types of
> bottleneck:
>
> 1) Multicore processor bottlenecks (standalone machines and clusters)
> 2) Network fabric bottlenecks (clusters)
> 3) File system bottlenecks (clusters)
>
> All three types of problems are due to contention for some type of
> system resource
> by the MPI processes that take part in a computation/program.
>
> Our focus on this thread, started by Zach, has been on problem 1),
> although most of us may need to look into problems 2) and 3) sooner
> or later.
> (I have all the three of them already!)
>
> The CFD folks use MPI as we do.
> They seem to use another MPI flavor, but the same problems are there.
> The problems are not caused by MPI itself, but they become apparent
> when you run MPI programs.
> That has been my experience too.
>
> As for how to map the MPI processes to specific processors (or cores),
> the key command seems to be "taskset", as my googling afternoon
> showed.
> Try "man taskset" for more info.
>
> For a standalone machine like yours, something like the command
> line below should work to
> force execution on "processors" 0 and 2 (which in my case are two
> different physical CPUs):
>
> mpiexec -n 2 taskset -c 0,2 my_mpi_program
>
> You need to check on your computer ("more /proc/cpuinfo")
> what are the exact "processor" numbers that correspond to separate
> physical CPUs. Most likely they are the even numbered processors
> only, or the odd numbered only,
> since you have dual-core CPUs (integers module 2), with
> "processors" 0,1 being the four
> cores of the first physical CPU, "processors" 2,3 the cores of the
> second physical CPU, and so on.
> At least, this is what I see on my dual-core dual-processor machine.
> I would say for quad-cores the separate physical CPUs would be
> processors 0,4,8, etc,
> or 1,5,7, etc, and so on (integers module 4), with "processors"
> 0,1,2,3 being the four cores
> in the first physical CPU, and so on.
> In /proc/cpuinfo look for the keyword "processor".
> These are the numbers you need to use in "taskset -c".
> However, other helpful information comes in the keywords "physical
> id",
> "core id", "siblings", and "cpu cores".
> They will allow you to map cores and physical CPUs to
> the "processor" number.
>
> The "taskset" command line above worked in one of my standalone
> multicore machines,
> and I hope a variant of it will work on your machine also.
> It works with the "mpiexec" that comes with the MPICH distribution,
> and also with
> the "mpiexec" associated to the Torque/PBS batch system, which is
> nice for clusters as well.
>
> "Taskset" can change the default behavior of the Linux scheduler,
> which is to allow processes to
> be moved from one core/CPU to another during execution.
> The scheduler does this to ensure optimal CPU use (i.e. load balance).
> With taskset you can force execution to happen on the cores you
> specify on the command line,
> i.e. you can force the so called "CPU affinity" you wish.
> Note that the "taskset" man page uses both the terms "CPU" and
> "processor", and doesn't use the term "core",
> which may be a bit confusing. Make no mistake, "processor" and
> "CPU" there stand for what we've been calling "core" here.
>
> Other postings that you may find useful on closely related topics are:
>
> http://www.ibm.com/developerworks/linux/library/l-scheduler/
> http://www.cyberciti.biz/tips/setting-processor-affinity-certain-
> task-or-process.html
>
> I hope this helps,
>
> Still, we have a long way to go to sort out how much of the
> multicore bottleneck can
> be ascribed to lack of memory bandwidth, and how much may be
> perhaps associated to how
> memcpy is compiled by different compilers,
> or if there are other components of this problem that we don't see
> now.
>
> Maybe our community won't find a solution to Zach's problem: "Why
> is my quad core slower than cluster?"
> However, I hope that through collaboration, and by sharing
> information,
> we may be able to nail down the root of the problem,
> and perhaps to find ways to improve the alarmingly bad performance
> some of us have reported on multicore machines.
>
>
> Gus Correa
>
> --
> ------------------------------ ---------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080715/a7b53d56/attachment.htm>
More information about the mpich-discuss
mailing list