[mpich-discuss] Why is my quad core slower than cluster
chong tan
chong_guan_tan at yahoo.com
Thu Jul 17 12:20:20 CDT 2008
grow up Gib. there is no NDA between empoer and employee in the USA. maybe there is in New Zealand, but I certainly don't know anything about New Zealand. I only know taht when a person buy a computer in NZ, the computer user count in NZ just doubled.
tan
--- On Wed, 7/16/08, Gib Bogle <g.bogle at auckland.ac.nz> wrote:
From: Gib Bogle <g.bogle at auckland.ac.nz>
Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
To: mpich-discuss at mcs.anl.gov
Date: Wednesday, July 16, 2008, 6:31 PM
And are we to believe that your employer's NDA doesn't permit you to say
WHICH answer is wrong? Whew!
Gib
chong tan wrote:
> Just FYI,
>
> from my knowledge, at least 1 answer to the question in that thread is
> absolutely wrong, according to HW information on hand. Some of the info
> in that thread are not applicable across the board, and the original
> question : threaded application, is not answered.
>
>
>
> whether to use numactl on NUMA system is situation dependent. In
> general, numactl is bad if you over subscribe the system.
>
>
>
> tan
>
>
>
> --- On *Tue, 7/15/08, Robert Kubrick /<robertkubrick at gmail.com>/*
wrote:
>
> From: Robert Kubrick <robertkubrick at gmail.com>
> Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
> To: mpich-discuss at mcs.anl.gov
> Date: Tuesday, July 15, 2008, 4:06 PM
>
> A recent (long) discussion about numactl and taskset on the beowulf
> mailing list:
> http://www.beowulf.org/archive/2008-June/021810.html
>
> On Jul 15, 2008, at 1:35 PM, chong tan wrote:
>
>> Eric,
>>
>> I know you are referring me as the one not sharing. I am no
>> expert on MP, but someone who have done his homeworks. I like to
>> share, but the NDAs and company policy say no.
>>
>> You have good points and did some good experiements. That is what
>> I expect most MP designers and users to have done at the first
place.
>>
>> There answers to the original question are simple :
>>
>> - on 2Xquad, you have one memory system, while on cluster, you
>> have 8 memory systems, the total bandwidth favor the cluster
>> considerably.
>>
>> - on cluster, there is not way for the process to be context
>> switched, while that can happen on 2XQuad. When this happens,
>> live is bad.
>>
>> - The only thing that favor the SMP is the cost of communication
>> and shared memory.
>>
>>
>>
>> There are more factors, Thea rt is balancing them to your favor.
>> In a way, the X86 Quad are not designed to let us load it up with
>> fat adnd heavy processes. That is what I have been saying all
>> along: know your HW first. Your MP solution should come second.
>> Whatever utilities you can find will help put the solution
together.
>>
>>
>>
>> So, the problem is not MPI in this case.
>>
>>
>>
>> tan
>>
>>
>>
>> --- On *Mon, 7/14/08, Eric A. Borisch /<eborisch at ieee.org
>> <mailto:eborisch at ieee.org>>/* wrote:
>>
>> From: Eric A. Borisch <eborisch at ieee.org
>> <mailto:eborisch at ieee.org>>
>> Subject: Re: [mpich-discuss] Why is my quad core slower than
>> cluster
>> To: mpich-discuss at mcs.anl.gov
<mailto:mpich-discuss at mcs.anl.gov>
>> Date: Monday, July 14, 2008, 9:36 PM
>>
>> Gus,
>>
>> Information sharing is truly the point of the mailing list.
>> Useful messages should ask questions or provide answers! :)
>>
>> Someone mentioned STREAM benchmarks (memory BW benchmarks) a
>> little while back. I did these when our new system came in a
>> while ago, so I dug them back out.
>>
>> This (STREAM) can be compiled to use MPI, but it is only a
>> synchronization tool, the benchmark is still a memory bus test
>> (each task is trying to run through memory, but this is not an
>> MPI communication test.)
>>
>> My results on a dual E5472 machine (Two Quad-core 3Ghz
>> packages; 1600MHz bus; 8 total cores)
>>
>> Results (each set are [1..8] processes in order),
>> double-precision array size = 20,000,000, run through 10
times.
>>
>> Function Rate (MB/s) Avg time Min time Max time
>> Copy: 2962.6937 0.1081 0.1080 0.1081
>> Copy: 5685.3008 0.1126 0.1126 0.1128
>> Copy: 5484.6846 0.1751 0.1750 0.1751
>> Copy: 7085.7959 0.1809 0.1806 0.1817
>> Copy: 5981.6033 0.2676 0.2675 0.2676
>> Copy: 7071.2490 0.2718 0.2715 0.2722
>> Copy: 6537.4934 0.3427 0.3426 0.3428
>> Copy: 7423.4545 0.3451 0.3449 0.3455
>>
>> Scale: 3011.8445 0.1063 0.1062 0.1063
>> Scale: 5675.8162 0.1128 0.1128 0.1129
>> Scale: 5474.8854 0.1754 0.1753 0.1754
>> Scale: 7068.6204 0.1814 0.1811 0.1819
>> Scale: 5974.6112 0.2679 0.2678 0.2680
>> Scale: 7063.8307 0.2721 0.2718 0.2725
>> Scale: 6533.4473 0.3430 0.3429 0.3431
>> Scale: 7418.6128 0.3453 0.3451 0.3456
>>
>> Add: 3184.3129 0.1508 0.1507 0.1508
>> Add: 5892.1781 0.1631 0.1629 0.1633
>> Add: 5588.0229 0.2577 0.2577 0.2578
>> Add: 7275.0745 0.2642 0.2639 0.2646
>> Add: 6175.7646 0.3887 0.3886 0.3889
>> Add: 7262.7112 0.3970 0.3965 0.3976
>> Add: 6687.7658 0.5025 0.5024 0.5026
>> Add: 7599.2516 0.5057 0.5053 0.5062
>>
>> Triad: 3224.7856 0.1489 0.1488 0.1489
>> Triad: 6021.2613 0.1596 0.1594 0.1598
>> Triad: 5609.9260 0.2567 0.2567 0.2568
>> Triad: 7293.2790 0.2637 0.2633 0.2641
>> Triad: 6185.4376 0.3881 0.3880 0.3881
>> Triad: 7279.1231 0.3958 0.3957 0.3961
>> Triad: 6691.8560 0.5022 0.5021 0.5022
>> Triad: 7604.1238 0.5052 0.5050 0.5057
>>
>> These work out to (~):
>> 1x
>> 1.9x
>> 1.8x
>> 2.3x
>> 1.9x
>> 2.2x
>> 2.1x
>> 2.4x
>>
>> for [1..8] cores.
>>
>> As you can see, it doesn't take eight cores to saturate
the
>> bus, even with a 1600MHz bus. Four of the eight cores running
>> does this trick.
>>
>> With all that said, there are still advantages to be had with
>> the multicore chipsets, but only if you're not blowing
full
>> tilt through memory. If it can fit the problem, do more inside
>> a loop rather than running multiple loops over the same
memory.
>>
>> For reference, here's what using the osu_mbw_mr test (from
>> MVAPICH2 1.0.2; I also have a cluster running nearby :)
>> compiled on MPICH2 (1.0.7rc1 with nemesis provides this
>> performance from one/two/four pairs (2/4/8 processes) of
>> producer/consumers:
>>
>> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
>> # [ pairs: 1 ] [ window size: 64 ]
>>
>> # Size MB/sec Messages/sec
>> 1 1.08 1076540.83
>> 2 2.14 1068102.24
>> 4 3.99 997382.24
>> 8 7.97 996419.66
>> 16 15.95 996567.63
>> 32 31.67 989660.29
>> 64 62.73 980084.91
>> 128 124.12 969676.18
>> 256 243.59 951527.62
>> 512 445.52 870159.34
>> 1024 810.28 791284.80
>> 2048 1357.25 662721.78
>> 4096 1935.08 472431.28
>> 8192 2454.29 299596.49
>> 16384 2717.61 165869.84
>> 32768 2900.23 88507.85
>> 65536 2279.71 34785.63
>> 131072 2540.51 19382.53
>> 262144 1335.16 5093.21
>> 524288 1364.05 2601.72
>> 1048576 1378.39 1314.53
>> 2097152 1380.78 658.41
>> 4194304 1343.48 320.31
>>
>> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
>> # [ pairs: 2 ] [ window size: 64 ]
>>
>> # Size MB/sec Messages/sec
>> 1 2.15 2150580.48
>> 2 4.22 2109761.12
>> 4 7.84 1960742.53
>> 8 15.80 1974733.92
>> 16 31.38 1961100.64
>> 32 62.32 1947654.32
>> 64 123.39 1928000.11
>> 128 243.19 1899957.22
>> 256 475.32 1856721.12
>> 512 856.90 1673642.10
>> 1024 1513.19 1477721.26
>> 2048 2312.91 1129351.07
>> 4096 2891.21 705861.12
>> 8192 3267.49 398863.98
>> 16384 3400.64 207558.54
>> 32768 3519.74 107413.93
>> 65536 3141.80 47940.04
>> 131072 3368.65 25700.76
>> 262144 2211.53 8436.31
>> 524288 2264.90 4319.95
>> 1048576 2282.69 2176.94
>> 2097152 2250.72 1073.23
>> 4194304 2087.00 497.58
>>
>> # OSU MPI Multi BW / Message Rate Test (Version 1.0)
>> # [ pairs: 4 ] [ window size: 64 ]
>>
>> # Size MB/sec Messages/sec
>> 1 3.65 3651934.64
>> 2 8.16 4080341.34
>> 4 15.66 3914908.02
>> 8 31.32 3915621.85
>> 16 62.67 3916764.51
>> 32 124.37 3886426.18
>> 64 246.38 3849640.84
>> 128 486.39 3799914.44
>> 256 942.40 3681232.25
>> 512 1664.21 3250414.19
>> 1024 2756.50 2691891.86
>> 2048 3829.45 1869848.54
>> 4096 4465.25 1090148.56
>> 8192 4777.45 583184.51
>> 16384 4822.75 294357.30
>> 32768 4829.77 147392.80
>> 65536 4556.93 69533.18
>> 131072 4789.32 36539.60
>> 262144 3631.68 13853.75
>> 524288 3679.31 7017.72
>> 1048576 3553.61 3388.99
>> 2097152 3113.12 1484.45
>> 4194304 2452.69 584.77
>>
>> So from a messaging standpoint, you can see that you squeeze
>> more data through with more processes; I'd guess that this
is
>> because there's processing to be done within MPI to move
the
>> data, and a lot of the bookkeeping steps probably cache well
>> (updating the same status structure on a communication
>> multiple times; perhaps reusing the structure for subsequent
>> transfers and finding it still in cache) so the performance
>> scaling is not completely FSB bound.
>>
>> I'm sure there's plenty of additional things that
could be
>> done here to test different CPU to process layouts, etc, but
>> in testing my own real-world code, I've found that,
>> unfortunately, "it depends." I have some code that
nearly
>> scales linearly (multiple computationally expensive operations
>> inside the innermost loop) and some that scales like the
>> STREAM results above ("add one to the next 20 million
points") ...
>>
>> As always, your mileage may vary. If your speedup looks like
>> the STREAM numbers above, you're likely memory bound. Try
to
>> reformulate your problem to go through memory slower but with
>> more done each pass, or invest in a cluster. At some point --
>> for some problems -- you can't beat more memory busses!
>>
>> Cheers,
>> Eric Borisch
>>
>> --
>> borisch.eric at mayo.edu <mailto:borisch.eric at mayo.edu>
>> MRI Research
>> Mayo Clinic
>>
>> On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa
>> <gus at ldeo.columbia.edu
<mailto:gus at ldeo.columbia.edu>> wrote:
>>
>> Hello Sami and list
>>
>> Oh, well, as you see, an expert who claims to know the
>> answers to these problems
>> seems not to be willing to share these answers with less
>> knowledgeable MPI users like us.
>> So, maybe we can find the answers ourselves, not by
>> individual "homework" brainstorming,
>> but through community collaboration and generous
>> information sharing,
>> which is the hallmark of this mailing list.
>>
>> I Googled around today to find out how to assign MPI
>> processes to specific processors,
>> and I found some interesting information on how to do it.
>>
>> Below is a link to a posting from the computational fluid
>> dynamics (CFD) community that may be of interest.
>> Not surprisingly, they are struggling with the same type
>> of problems all of us have,
>> including how to tie MPI processes to specific processors:
>>
>>
http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006
>>
<http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006>
>>
>> I would summarize these problems as related to three types
>> of bottleneck:
>>
>> 1) Multicore processor bottlenecks (standalone machines
>> and clusters)
>> 2) Network fabric bottlenecks (clusters)
>> 3) File system bottlenecks (clusters)
>>
>> All three types of problems are due to contention for some
>> type of system resource
>> by the MPI processes that take part in a
computation/program.
>>
>> Our focus on this thread, started by Zach, has been on
>> problem 1),
>> although most of us may need to look into problems 2) and
>> 3) sooner or later.
>> (I have all the three of them already!)
>>
>> The CFD folks use MPI as we do.
>> They seem to use another MPI flavor, but the same problems
>> are there.
>> The problems are not caused by MPI itself, but they become
>> apparent when you run MPI programs.
>> That has been my experience too.
>>
>> As for how to map the MPI processes to specific processors
>> (or cores),
>> the key command seems to be "taskset", as my
googling
>> afternoon showed.
>> Try "man taskset" for more info.
>>
>> For a standalone machine like yours, something like the
>> command line below should work to
>> force execution on "processors" 0 and 2 (which
in my case
>> are two different physical CPUs):
>>
>> mpiexec -n 2 taskset -c 0,2 my_mpi_program
>>
>> You need to check on your computer ("more
/proc/cpuinfo")
>> what are the exact "processor" numbers that
correspond to
>> separate physical CPUs. Most likely they are the even
>> numbered processors only, or the odd numbered only,
>> since you have dual-core CPUs (integers module 2), with
>> "processors" 0,1 being the four
>> cores of the first physical CPU, "processors"
2,3 the
>> cores of the second physical CPU, and so on.
>> At least, this is what I see on my dual-core
>> dual-processor machine.
>> I would say for quad-cores the separate physical CPUs
>> would be processors 0,4,8, etc,
>> or 1,5,7, etc, and so on (integers module 4), with
>> "processors" 0,1,2,3 being the four cores
>> in the first physical CPU, and so on.
>> In /proc/cpuinfo look for the keyword
"processor".
>> These are the numbers you need to use in "taskset
-c".
>> However, other helpful information comes in the keywords
>> "physical id",
>> "core id", "siblings", and "cpu
cores".
>> They will allow you to map cores and physical CPUs to
>> the "processor" number.
>>
>> The "taskset" command line above worked in one
of my
>> standalone multicore machines,
>> and I hope a variant of it will work on your machine also.
>> It works with the "mpiexec" that comes with the
MPICH
>> distribution, and also with
>> the "mpiexec" associated to the Torque/PBS batch
system,
>> which is nice for clusters as well.
>>
>> "Taskset" can change the default behavior of the
Linux
>> scheduler, which is to allow processes to
>> be moved from one core/CPU to another during execution.
>> The scheduler does this to ensure optimal CPU use (i.e.
>> load balance).
>> With taskset you can force execution to happen on the
>> cores you specify on the command line,
>> i.e. you can force the so called "CPU affinity"
you wish.
>> Note that the "taskset" man page uses both the
terms "CPU"
>> and "processor", and doesn't use the term
"core",
>> which may be a bit confusing. Make no mistake,
>> "processor" and "CPU" there stand for
what we've been
>> calling "core" here.
>>
>> Other postings that you may find useful on closely related
>> topics are:
>>
>>
http://www.ibm.com/developerworks/linux/library/l-scheduler/
>>
<http://www.ibm.com/developerworks/linux/library/l-scheduler/>
>>
http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
>>
<http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html>
>>
>> I hope this helps,
>>
>> Still, we have a long way to go to sort out how much of
>> the multicore bottleneck can
>> be ascribed to lack of memory bandwidth, and how much may
>> be perhaps associated to how
>> memcpy is compiled by different compilers,
>> or if there are other components of this problem that we
>> don't see now.
>>
>> Maybe our community won't find a solution to
Zach's
>> problem: "Why is my quad core slower than
cluster?"
>> However, I hope that through collaboration, and by sharing
>> information,
>> we may be able to nail down the root of the problem,
>> and perhaps to find ways to improve the alarmingly bad
>> performance
>> some of us have reported on multicore machines.
>>
>>
>> Gus Correa
>>
>> --
>> ------------------------------
>> ---------------------------------------
>> Gustavo J. Ponce Correa, PhD - Email:
>> gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>> Lamont-Doherty Earth Observatory - Columbia University
>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 -
USA
>>
---------------------------------------------------------------------
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080717/2f594843/attachment.htm>
More information about the mpich-discuss
mailing list