[mpich-discuss] Why is my quad core slower than cluster
Eric A. Borisch
eborisch at ieee.org
Mon Jul 14 23:36:34 CDT 2008
Gus,
Information sharing is truly the point of the mailing list. Useful messages
should ask questions or provide answers! :)
Someone mentioned STREAM benchmarks (memory BW benchmarks) a little while
back. I did these when our new system came in a while ago, so I dug them
back out.
This (STREAM) can be compiled to use MPI, but it is only a synchronization
tool, the benchmark is still a memory bus test (each task is trying to run
through memory, but this is not an MPI communication test.)
My results on a dual E5472 machine (Two Quad-core 3Ghz packages; 1600MHz
bus; 8 total cores)
Results (each set are [1..8] processes in order), double-precision array
size = 20,000,000, run through 10 times.
Function Rate (MB/s) Avg time Min time Max time
Copy: 2962.6937 0.1081 0.1080 0.1081
Copy: 5685.3008 0.1126 0.1126 0.1128
Copy: 5484.6846 0.1751 0.1750 0.1751
Copy: 7085.7959 0.1809 0.1806 0.1817
Copy: 5981.6033 0.2676 0.2675 0.2676
Copy: 7071.2490 0.2718 0.2715 0.2722
Copy: 6537.4934 0.3427 0.3426 0.3428
Copy: 7423.4545 0.3451 0.3449 0.3455
Scale: 3011.8445 0.1063 0.1062 0.1063
Scale: 5675.8162 0.1128 0.1128 0.1129
Scale: 5474.8854 0.1754 0.1753 0.1754
Scale: 7068.6204 0.1814 0.1811 0.1819
Scale: 5974.6112 0.2679 0.2678 0.2680
Scale: 7063.8307 0.2721 0.2718 0.2725
Scale: 6533.4473 0.3430 0.3429 0.3431
Scale: 7418.6128 0.3453 0.3451 0.3456
Add: 3184.3129 0.1508 0.1507 0.1508
Add: 5892.1781 0.1631 0.1629 0.1633
Add: 5588.0229 0.2577 0.2577 0.2578
Add: 7275.0745 0.2642 0.2639 0.2646
Add: 6175.7646 0.3887 0.3886 0.3889
Add: 7262.7112 0.3970 0.3965 0.3976
Add: 6687.7658 0.5025 0.5024 0.5026
Add: 7599.2516 0.5057 0.5053 0.5062
Triad: 3224.7856 0.1489 0.1488 0.1489
Triad: 6021.2613 0.1596 0.1594 0.1598
Triad: 5609.9260 0.2567 0.2567 0.2568
Triad: 7293.2790 0.2637 0.2633 0.2641
Triad: 6185.4376 0.3881 0.3880 0.3881
Triad: 7279.1231 0.3958 0.3957 0.3961
Triad: 6691.8560 0.5022 0.5021 0.5022
Triad: 7604.1238 0.5052 0.5050 0.5057
These work out to (~):
1x
1.9x
1.8x
2.3x
1.9x
2.2x
2.1x
2.4x
for [1..8] cores.
As you can see, it doesn't take eight cores to saturate the bus, even with a
1600MHz bus. Four of the eight cores running does this trick.
With all that said, there are still advantages to be had with the multicore
chipsets, but only if you're not blowing full tilt through memory. If it can
fit the problem, do more inside a loop rather than running multiple loops
over the same memory.
For reference, here's what using the osu_mbw_mr test (from MVAPICH2 1.0.2; I
also have a cluster running nearby :) compiled on MPICH2 (1.0.7rc1 with
nemesis provides this performance from one/two/four pairs (2/4/8 processes)
of producer/consumers:
# OSU MPI Multi BW / Message Rate Test (Version 1.0)
# [ pairs: 1 ] [ window size: 64 ]
# Size MB/sec Messages/sec
1 1.08 1076540.83
2 2.14 1068102.24
4 3.99 997382.24
8 7.97 996419.66
16 15.95 996567.63
32 31.67 989660.29
64 62.73 980084.91
128 124.12 969676.18
256 243.59 951527.62
512 445.52 870159.34
1024 810.28 791284.80
2048 1357.25 662721.78
4096 1935.08 472431.28
8192 2454.29 299596.49
16384 2717.61 165869.84
32768 2900.23 88507.85
65536 2279.71 34785.63
131072 2540.51 19382.53
262144 1335.16 5093.21
524288 1364.05 2601.72
1048576 1378.39 1314.53
2097152 1380.78 658.41
4194304 1343.48 320.31
# OSU MPI Multi BW / Message Rate Test (Version 1.0)
# [ pairs: 2 ] [ window size: 64 ]
# Size MB/sec Messages/sec
1 2.15 2150580.48
2 4.22 2109761.12
4 7.84 1960742.53
8 15.80 1974733.92
16 31.38 1961100.64
32 62.32 1947654.32
64 123.39 1928000.11
128 243.19 1899957.22
256 475.32 1856721.12
512 856.90 1673642.10
1024 1513.19 1477721.26
2048 2312.91 1129351.07
4096 2891.21 705861.12
8192 3267.49 398863.98
16384 3400.64 207558.54
32768 3519.74 107413.93
65536 3141.80 47940.04
131072 3368.65 25700.76
262144 2211.53 8436.31
524288 2264.90 4319.95
1048576 2282.69 2176.94
2097152 2250.72 1073.23
4194304 2087.00 497.58
# OSU MPI Multi BW / Message Rate Test (Version 1.0)
# [ pairs: 4 ] [ window size: 64 ]
# Size MB/sec Messages/sec
1 3.65 3651934.64
2 8.16 4080341.34
4 15.66 3914908.02
8 31.32 3915621.85
16 62.67 3916764.51
32 124.37 3886426.18
64 246.38 3849640.84
128 486.39 3799914.44
256 942.40 3681232.25
512 1664.21 3250414.19
1024 2756.50 2691891.86
2048 3829.45 1869848.54
4096 4465.25 1090148.56
8192 4777.45 583184.51
16384 4822.75 294357.30
32768 4829.77 147392.80
65536 4556.93 69533.18
131072 4789.32 36539.60
262144 3631.68 13853.75
524288 3679.31 7017.72
1048576 3553.61 3388.99
2097152 3113.12 1484.45
4194304 2452.69 584.77
So from a messaging standpoint, you can see that you squeeze more data
through with more processes; I'd guess that this is because there's
processing to be done within MPI to move the data, and a lot of the
bookkeeping steps probably cache well (updating the same status structure on
a communication multiple times; perhaps reusing the structure for subsequent
transfers and finding it still in cache) so the performance scaling is not
completely FSB bound.
I'm sure there's plenty of additional things that could be done here to test
different CPU to process layouts, etc, but in testing my own real-world
code, I've found that, unfortunately, "it depends." I have some code that
nearly scales linearly (multiple computationally expensive operations inside
the innermost loop) and some that scales like the STREAM results above ("add
one to the next 20 million points") ...
As always, your mileage may vary. If your speedup looks like the STREAM
numbers above, you're likely memory bound. Try to reformulate your problem
to go through memory slower but with more done each pass, or invest in a
cluster. At some point -- for some problems -- you can't beat more memory
busses!
Cheers,
Eric Borisch
--
borisch.eric at mayo.edu
MRI Research
Mayo Clinic
On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hello Sami and list
>
> Oh, well, as you see, an expert who claims to know the answers to these
> problems
> seems not to be willing to share these answers with less knowledgeable MPI
> users like us.
> So, maybe we can find the answers ourselves, not by individual "homework"
> brainstorming,
> but through community collaboration and generous information sharing,
> which is the hallmark of this mailing list.
>
> I Googled around today to find out how to assign MPI processes to specific
> processors,
> and I found some interesting information on how to do it.
>
> Below is a link to a posting from the computational fluid dynamics (CFD)
> community that may be of interest.
> Not surprisingly, they are struggling with the same type of problems all of
> us have,
> including how to tie MPI processes to specific processors:
>
> http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.
> cgi?file=/1/5949.html#POST18006
>
> I would summarize these problems as related to three types of bottleneck:
>
> 1) Multicore processor bottlenecks (standalone machines and clusters)
> 2) Network fabric bottlenecks (clusters)
> 3) File system bottlenecks (clusters)
>
> All three types of problems are due to contention for some type of system
> resource
> by the MPI processes that take part in a computation/program.
>
> Our focus on this thread, started by Zach, has been on problem 1),
> although most of us may need to look into problems 2) and 3) sooner or
> later.
> (I have all the three of them already!)
>
> The CFD folks use MPI as we do.
> They seem to use another MPI flavor, but the same problems are there.
> The problems are not caused by MPI itself, but they become apparent when
> you run MPI programs.
> That has been my experience too.
>
> As for how to map the MPI processes to specific processors (or cores),
> the key command seems to be "taskset", as my googling afternoon showed.
> Try "man taskset" for more info.
>
> For a standalone machine like yours, something like the command line below
> should work to
> force execution on "processors" 0 and 2 (which in my case are two different
> physical CPUs):
>
> mpiexec -n 2 taskset -c 0,2 my_mpi_program
>
> You need to check on your computer ("more /proc/cpuinfo")
> what are the exact "processor" numbers that correspond to separate physical
> CPUs. Most likely they are the even numbered processors only, or the odd
> numbered only,
> since you have dual-core CPUs (integers module 2), with "processors" 0,1
> being the four
> cores of the first physical CPU, "processors" 2,3 the cores of the second
> physical CPU, and so on.
> At least, this is what I see on my dual-core dual-processor machine.
> I would say for quad-cores the separate physical CPUs would be processors
> 0,4,8, etc,
> or 1,5,7, etc, and so on (integers module 4), with "processors" 0,1,2,3
> being the four cores
> in the first physical CPU, and so on.
> In /proc/cpuinfo look for the keyword "processor".
> These are the numbers you need to use in "taskset -c".
> However, other helpful information comes in the keywords "physical id",
> "core id", "siblings", and "cpu cores".
> They will allow you to map cores and physical CPUs to
> the "processor" number.
>
> The "taskset" command line above worked in one of my standalone multicore
> machines,
> and I hope a variant of it will work on your machine also.
> It works with the "mpiexec" that comes with the MPICH distribution, and
> also with
> the "mpiexec" associated to the Torque/PBS batch system, which is nice for
> clusters as well.
>
> "Taskset" can change the default behavior of the Linux scheduler, which is
> to allow processes to
> be moved from one core/CPU to another during execution.
> The scheduler does this to ensure optimal CPU use (i.e. load balance).
> With taskset you can force execution to happen on the cores you specify on
> the command line,
> i.e. you can force the so called "CPU affinity" you wish.
> Note that the "taskset" man page uses both the terms "CPU" and "processor",
> and doesn't use the term "core",
> which may be a bit confusing. Make no mistake, "processor" and "CPU" there
> stand for what we've been calling "core" here.
>
> Other postings that you may find useful on closely related topics are:
>
> http://www.ibm.com/developerworks/linux/library/l-scheduler/
> http://www.cyberciti.biz/tips/setting-processor-affinity-
> certain-task-or-process.html
>
> I hope this helps,
>
> Still, we have a long way to go to sort out how much of the multicore
> bottleneck can
> be ascribed to lack of memory bandwidth, and how much may be perhaps
> associated to how
> memcpy is compiled by different compilers,
> or if there are other components of this problem that we don't see now.
>
> Maybe our community won't find a solution to Zach's problem: "Why is my
> quad core slower than cluster?"
> However, I hope that through collaboration, and by sharing information,
> we may be able to nail down the root of the problem,
> and perhaps to find ways to improve the alarmingly bad performance
> some of us have reported on multicore machines.
>
> Gus Correa
>
> --
> ------------------------------ ---------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080714/c9882b78/attachment.htm>
More information about the mpich-discuss
mailing list