<div dir="ltr">If you people are going to fight like children and clog up my email with garbage like this, at least learn how to spell.<br><br><div class="gmail_quote">On Thu, Jul 17, 2008 at 10:20 AM, chong tan &lt;<a href="mailto:chong_guan_tan@yahoo.com">chong_guan_tan@yahoo.com</a>&gt; wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-size: inherit; line-height: inherit; font-size-adjust: inherit; font-stretch: inherit;" valign="top">
<p>grow up Gib.&nbsp; there is no NDA between empoer and employee in the USA.&nbsp; maybe there is in New Zealand, but I certainly don&#39;t know anything about New Zealand.&nbsp; I only know taht when a person buy a computer in NZ, the computer user count in NZ just doubled.</p>

<p>tan</p>
<p><br><br>--- On <b>Wed, 7/16/08, Gib Bogle <i>&lt;<a href="mailto:g.bogle@auckland.ac.nz" target="_blank">g.bogle@auckland.ac.nz</a>&gt;</i></b> wrote:<br></p>
<blockquote style="border-left: 2px solid rgb(16, 16, 255); padding-left: 5px; margin-left: 5px;">From: Gib Bogle &lt;<a href="mailto:g.bogle@auckland.ac.nz" target="_blank">g.bogle@auckland.ac.nz</a>&gt;<div class="Ih2E3d">
<br>Subject: Re: [mpich-discuss] Why is my quad core slower than cluster<br>To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br></div>Date: Wednesday, July 16, 2008, 6:31 PM<div>
<div></div><div class="Wj3C7c"><br><br><pre>And are we to believe that your employer&#39;s NDA doesn&#39;t permit you to say
WHICH answer is wrong?  Whew!

Gib

chong tan wrote:
&gt; Just FYI,
&gt; 
&gt; from my knowledge, at least 1 answer to the question in that thread is 
&gt; absolutely wrong, according to HW information on hand.  Some of the info 
&gt; in that thread are not applicable across the board, and the original 
&gt; question : threaded application, is not answered.
&gt; 
&gt;  
&gt; 
&gt; whether to use numactl on NUMA system is situation dependent.  In 
&gt; general, numactl is bad if you over subscribe the system.
&gt; 
&gt;  
&gt; 
&gt; tan
&gt; 
&gt; 
&gt; 
&gt; --- On *Tue, 7/15/08, Robert Kubrick /&lt;<a href="mailto:robertkubrick@gmail.com" target="_blank">robertkubrick@gmail.com</a>&gt;/*
wrote:
&gt; 
&gt;     From: Robert Kubrick &lt;<a href="mailto:robertkubrick@gmail.com" target="_blank">robertkubrick@gmail.com</a>&gt;
&gt;     Subject: Re: [mpich-discuss] Why is my quad core slower than cluster
&gt;     To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>
&gt;     Date: Tuesday, July 15, 2008, 4:06 PM
&gt; 
&gt;     A recent (long) discussion about numactl and taskset on the beowulf
&gt;     mailing list:
&gt;     <a href="http://www.beowulf.org/archive/2008-June/021810.html" target="_blank">http://www.beowulf.org/archive/2008-June/021810.html</a>
&gt; 
&gt;     On Jul 15, 2008, at 1:35 PM, chong tan wrote:
&gt; 
&gt;&gt;     Eric,
&gt;&gt;
&gt;&gt;     I know you are referring me as the one not sharing.  I am no
&gt;&gt;     expert on MP, but someone who have done his homeworks.  I like to
&gt;&gt;     share, but the NDAs and company policy say no.   
&gt;&gt;
&gt;&gt;     You have good points and did some good experiements.  That is what
&gt;&gt;     I expect most MP designers and users to have done at the first
place.
&gt;&gt;
&gt;&gt;     There answers to the original question are simple :
&gt;&gt;
&gt;&gt;     - on 2Xquad, you have one memory system, while on cluster, you
&gt;&gt;     have 8 memory systems, the total bandwidth favor the cluster
&gt;&gt;     considerably.
&gt;&gt;
&gt;&gt;     - on cluster, there is not way for the process to be context
&gt;&gt;     switched, while that can happen on 2XQuad.  When this happens,
&gt;&gt;     live is bad.
&gt;&gt;
&gt;&gt;     - The only thing that favor the SMP is the cost of communication
&gt;&gt;     and shared memory.
&gt;&gt;
&gt;&gt;      
&gt;&gt;
&gt;&gt;     There are more factors, Thea rt is balancing them to your favor. 
&gt;&gt;     In a way, the X86 Quad are not designed to let us load it up with
&gt;&gt;     fat adnd heavy processes.  That is what I have been saying all
&gt;&gt;     along: know your HW first.  Your MP solution should come second. 
&gt;&gt;     Whatever utilities you can find will help put the solution
together.
&gt;&gt;
&gt;&gt;      
&gt;&gt;
&gt;&gt;     So, the problem is not MPI in this case.
&gt;&gt;
&gt;&gt;      
&gt;&gt;
&gt;&gt;     tan
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;     --- On *Mon, 7/14/08, Eric A. Borisch /&lt;<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>
&gt;&gt;     &lt;mailto:<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>&gt;&gt;/* wrote:
&gt;&gt;
&gt;&gt;         From: Eric A. Borisch &lt;<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>
&gt;&gt;         &lt;mailto:<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>&gt;&gt;
&gt;&gt;         Subject: Re: [mpich-discuss] Why is my quad core slower than
&gt;&gt;         cluster
&gt;&gt;         To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>
&lt;mailto:<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>&gt;
&gt;&gt;         Date: Monday, July 14, 2008, 9:36 PM
&gt;&gt;
&gt;&gt;         Gus,
&gt;&gt;
&gt;&gt;         Information sharing is truly the point of the mailing list.
&gt;&gt;         Useful messages should ask questions or provide answers! :)
&gt;&gt;
&gt;&gt;         Someone mentioned STREAM benchmarks (memory BW benchmarks) a
&gt;&gt;         little while back. I did these when our new system came in a
&gt;&gt;         while ago, so I dug them back out.
&gt;&gt;
&gt;&gt;         This (STREAM) can be compiled to use MPI, but it is only a
&gt;&gt;         synchronization tool, the benchmark is still a memory bus test
&gt;&gt;         (each task is trying to run through memory, but this is not an
&gt;&gt;         MPI communication test.)
&gt;&gt;
&gt;&gt;         My results on a dual E5472 machine (Two Quad-core 3Ghz
&gt;&gt;         packages; 1600MHz bus; 8 total cores)
&gt;&gt;
&gt;&gt;         Results (each set are [1..8] processes in order),
&gt;&gt;         double-precision array size = 20,000,000, run through 10
times.
&gt;&gt;
&gt;&gt;         Function     Rate (MB/s)  Avg time   Min time  Max time
&gt;&gt;         Copy:       2962.6937      0.1081      0.1080      0.1081
&gt;&gt;         Copy:       5685.3008      0.1126      0.1126      0.1128
&gt;&gt;         Copy:       5484.6846      0.1751      0.1750      0.1751
&gt;&gt;         Copy:       7085.7959      0.1809      0.1806      0.1817
&gt;&gt;         Copy:       5981.6033      0.2676      0.2675      0.2676
&gt;&gt;         Copy:       7071.2490      0.2718      0.2715      0.2722
&gt;&gt;         Copy:       6537.4934      0.3427      0.3426      0.3428
&gt;&gt;         Copy:       7423.4545      0.3451      0.3449      0.3455
&gt;&gt;
&gt;&gt;         Scale:      3011.8445      0.1063      0.1062      0.1063
&gt;&gt;         Scale:      5675.8162      0.1128      0.1128      0.1129
&gt;&gt;         Scale:      5474.8854      0.1754      0.1753      0.1754
&gt;&gt;         Scale:      7068.6204      0.1814      0.1811      0.1819
&gt;&gt;         Scale:      5974.6112      0.2679      0.2678      0.2680
&gt;&gt;         Scale:      7063.8307      0.2721      0.2718      0.2725
&gt;&gt;         Scale:      6533.4473      0.3430      0.3429      0.3431
&gt;&gt;         Scale:      7418.6128      0.3453      0.3451      0.3456
&gt;&gt;
&gt;&gt;         Add:        3184.3129      0.1508      0.1507      0.1508
&gt;&gt;         Add:        5892.1781      0.1631      0.1629      0.1633
&gt;&gt;         Add:        5588.0229      0.2577      0.2577      0.2578
&gt;&gt;         Add:        7275.0745      0.2642      0.2639      0.2646
&gt;&gt;         Add:        6175.7646      0.3887      0.3886      0.3889
&gt;&gt;         Add:        7262.7112      0.3970      0.3965      0.3976
&gt;&gt;         Add:        6687.7658      0.5025      0.5024      0.5026
&gt;&gt;         Add:        7599.2516      0.5057      0.5053      0.5062
&gt;&gt;
&gt;&gt;         Triad:      3224.7856      0.1489      0.1488      0.1489
&gt;&gt;         Triad:      6021.2613      0.1596      0.1594      0.1598
&gt;&gt;         Triad:      5609.9260      0.2567      0.2567      0.2568
&gt;&gt;         Triad:      7293.2790      0.2637      0.2633      0.2641
&gt;&gt;         Triad:      6185.4376      0.3881      0.3880      0.3881
&gt;&gt;         Triad:      7279.1231      0.3958      0.3957      0.3961
&gt;&gt;         Triad:      6691.8560      0.5022      0.5021      0.5022
&gt;&gt;         Triad:      7604.1238      0.5052      0.5050      0.5057
&gt;&gt;
&gt;&gt;         These work out to (~):
&gt;&gt;         1x
&gt;&gt;         1.9x
&gt;&gt;         1.8x
&gt;&gt;         2.3x
&gt;&gt;         1.9x
&gt;&gt;         2.2x
&gt;&gt;         2.1x
&gt;&gt;         2.4x
&gt;&gt;          
&gt;&gt;         for [1..8] cores.
&gt;&gt;
&gt;&gt;         As you can see, it doesn&#39;t take eight cores to saturate
the
&gt;&gt;         bus, even with a 1600MHz bus. Four of the eight cores running
&gt;&gt;         does this trick.
&gt;&gt;
&gt;&gt;         With all that said, there are still advantages to be had with
&gt;&gt;         the multicore chipsets, but only if you&#39;re not blowing
full
&gt;&gt;         tilt through memory. If it can fit the problem, do more inside
&gt;&gt;         a loop rather than running multiple loops over the same
memory. 
&gt;&gt;
&gt;&gt;         For reference, here&#39;s what using the osu_mbw_mr test (from
&gt;&gt;         MVAPICH2 1.0.2; I also have a cluster running nearby :)
&gt;&gt;         compiled on MPICH2 (1.0.7rc1 with nemesis provides this
&gt;&gt;         performance from one/two/four pairs (2/4/8 processes) of
&gt;&gt;         producer/consumers:
&gt;&gt;
&gt;&gt;         # OSU MPI Multi BW / Message Rate Test (Version 1.0)
&gt;&gt;         # [ pairs: 1 ] [ window size: 64 ]
&gt;&gt;
&gt;&gt;         #  Size    MB/sec    Messages/sec
&gt;&gt;               1      1.08   1076540.83
&gt;&gt;               2      2.14   1068102.24
&gt;&gt;               4      3.99    997382.24
&gt;&gt;               8      7.97    996419.66
&gt;&gt;              16     15.95    996567.63
&gt;&gt;              32     31.67    989660.29
&gt;&gt;              64     62.73    980084.91
&gt;&gt;             128    124.12    969676.18
&gt;&gt;             256    243.59    951527.62
&gt;&gt;             512    445.52    870159.34
&gt;&gt;            1024    810.28    791284.80
&gt;&gt;            2048   1357.25    662721.78
&gt;&gt;            4096   1935.08    472431.28
&gt;&gt;            8192   2454.29    299596.49
&gt;&gt;           16384   2717.61    165869.84
&gt;&gt;           32768   2900.23     88507.85
&gt;&gt;           65536   2279.71     34785.63
&gt;&gt;          131072   2540.51     19382.53
&gt;&gt;          262144   1335.16      5093.21
&gt;&gt;          524288   1364.05      2601.72
&gt;&gt;         1048576   1378.39      1314.53
&gt;&gt;         2097152   1380.78       658.41
&gt;&gt;         4194304   1343.48       320.31
&gt;&gt;
&gt;&gt;         # OSU MPI Multi BW / Message Rate Test (Version 1.0)
&gt;&gt;         # [ pairs: 2 ] [ window size: 64 ]
&gt;&gt;
&gt;&gt;         #  Size    MB/sec    Messages/sec
&gt;&gt;               1      2.15   2150580.48
&gt;&gt;               2      4.22   2109761.12
&gt;&gt;               4      7.84   1960742.53
&gt;&gt;               8     15.80   1974733.92
&gt;&gt;              16     31.38   1961100.64
&gt;&gt;              32     62.32   1947654.32
&gt;&gt;              64    123.39   1928000.11
&gt;&gt;             128    243.19   1899957.22
&gt;&gt;             256    475.32   1856721.12
&gt;&gt;             512    856.90   1673642.10
&gt;&gt;            1024   1513.19   1477721.26
&gt;&gt;            2048   2312.91   1129351.07
&gt;&gt;            4096   2891.21    705861.12
&gt;&gt;            8192   3267.49    398863.98
&gt;&gt;           16384   3400.64    207558.54
&gt;&gt;           32768   3519.74    107413.93
&gt;&gt;           65536   3141.80     47940.04
&gt;&gt;          131072   3368.65     25700.76
&gt;&gt;          262144   2211.53      8436.31
&gt;&gt;          524288   2264.90      4319.95
&gt;&gt;         1048576   2282.69      2176.94
&gt;&gt;         2097152   2250.72      1073.23
&gt;&gt;         4194304   2087.00       497.58
&gt;&gt;
&gt;&gt;         # OSU MPI Multi BW / Message Rate Test (Version 1.0)
&gt;&gt;         # [ pairs: 4 ] [ window size: 64 ]
&gt;&gt;
&gt;&gt;         #  Size    MB/sec    Messages/sec
&gt;&gt;               1      3.65   3651934.64
&gt;&gt;               2      8.16   4080341.34
&gt;&gt;               4     15.66   3914908.02
&gt;&gt;               8     31.32   3915621.85
&gt;&gt;              16     62.67   3916764.51
&gt;&gt;              32    124.37   3886426.18
&gt;&gt;              64    246.38   3849640.84
&gt;&gt;             128    486.39   3799914.44
&gt;&gt;             256    942.40   3681232.25
&gt;&gt;             512   1664.21   3250414.19
&gt;&gt;            1024   2756.50   2691891.86
&gt;&gt;            2048   3829.45   1869848.54
&gt;&gt;            4096   4465.25   1090148.56
&gt;&gt;            8192   4777.45    583184.51
&gt;&gt;           16384   4822.75    294357.30
&gt;&gt;           32768   4829.77    147392.80
&gt;&gt;           65536   4556.93     69533.18
&gt;&gt;          131072   4789.32     36539.60
&gt;&gt;          262144   3631.68     13853.75
&gt;&gt;          524288   3679.31      7017.72
&gt;&gt;         1048576   3553.61      3388.99
&gt;&gt;         2097152   3113.12      1484.45
&gt;&gt;         4194304   2452.69       584.77
&gt;&gt;
&gt;&gt;         So from a messaging standpoint, you can see that you squeeze
&gt;&gt;         more data through with more processes; I&#39;d guess that this
is
&gt;&gt;         because there&#39;s processing to be done within MPI to move
the
&gt;&gt;         data, and a lot of the bookkeeping steps probably cache well
&gt;&gt;         (updating the same status structure on a communication
&gt;&gt;         multiple times; perhaps reusing the structure for subsequent
&gt;&gt;         transfers and finding it still in cache) so the performance
&gt;&gt;         scaling is not completely FSB bound.
&gt;&gt;
&gt;&gt;         I&#39;m sure there&#39;s plenty of additional things that
could be
&gt;&gt;         done here to test different CPU to process layouts, etc, but
&gt;&gt;         in testing my own real-world code, I&#39;ve found that,
&gt;&gt;         unfortunately, &quot;it depends.&quot; I have some code that
nearly
&gt;&gt;         scales linearly (multiple computationally expensive operations
&gt;&gt;         inside the innermost loop) and some that scales like the
&gt;&gt;         STREAM results above (&quot;add one to the next 20 million
points&quot;) ...
&gt;&gt;
&gt;&gt;         As always, your mileage may vary. If your speedup looks like
&gt;&gt;         the STREAM numbers above, you&#39;re likely memory bound. Try
to
&gt;&gt;         reformulate your problem to go through memory slower but with
&gt;&gt;         more done each pass, or invest in a cluster. At some point --
&gt;&gt;         for some problems -- you can&#39;t beat more memory busses!
&gt;&gt;
&gt;&gt;         Cheers,
&gt;&gt;          Eric Borisch
&gt;&gt;
&gt;&gt;         --
&gt;&gt;          <a href="mailto:borisch.eric@mayo.edu" target="_blank">borisch.eric@mayo.edu</a> &lt;mailto:<a href="mailto:borisch.eric@mayo.edu" target="_blank">borisch.eric@mayo.edu</a>&gt;
&gt;&gt;          MRI Research
&gt;&gt;          Mayo Clinic
&gt;&gt;
&gt;&gt;         On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa
&gt;&gt;         &lt;<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>
&lt;mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>&gt;&gt; wrote:
&gt;&gt;
&gt;&gt;             Hello Sami and list
&gt;&gt;
&gt;&gt;             Oh, well, as you see, an expert who claims to know the
&gt;&gt;             answers to these problems
&gt;&gt;             seems not to be willing to share these answers with less
&gt;&gt;             knowledgeable MPI users like us.
&gt;&gt;             So, maybe we can find the answers ourselves, not by
&gt;&gt;             individual &quot;homework&quot; brainstorming,
&gt;&gt;             but through community collaboration and generous
&gt;&gt;             information sharing,
&gt;&gt;             which is the hallmark of this mailing list.
&gt;&gt;
&gt;&gt;             I Googled around today to find out how to assign MPI
&gt;&gt;             processes to specific processors,
&gt;&gt;             and I found some interesting information on how to do it.
&gt;&gt;
&gt;&gt;             Below is a link to a posting from the computational fluid
&gt;&gt;             dynamics (CFD) community that may be of interest.
&gt;&gt;             Not surprisingly, they are struggling with the same type
&gt;&gt;             of problems all of us have,
&gt;&gt;             including how to tie MPI processes to specific processors:
&gt;&gt;
&gt;&gt;            
<a href="http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006" target="_blank">http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006</a>
&gt;&gt;            
&lt;<a href="http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006" target="_blank">http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006</a>&gt;
&gt;&gt;
&gt;&gt;             I would summarize these problems as related to three types
&gt;&gt;             of bottleneck:
&gt;&gt;
&gt;&gt;             1) Multicore processor bottlenecks (standalone machines
&gt;&gt;             and clusters)
&gt;&gt;             2) Network fabric bottlenecks (clusters)
&gt;&gt;             3) File system bottlenecks (clusters)
&gt;&gt;
&gt;&gt;             All three types of problems are due to contention for some
&gt;&gt;             type of system resource
&gt;&gt;             by the MPI processes that take part in a
computation/program.
&gt;&gt;
&gt;&gt;             Our focus on this thread, started by Zach, has been on
&gt;&gt;             problem 1),
&gt;&gt;             although most of us may need to look into problems 2) and
&gt;&gt;             3) sooner or later.
&gt;&gt;             (I have all the three of them already!)
&gt;&gt;
&gt;&gt;             The CFD folks use MPI as we do.
&gt;&gt;             They seem to use another MPI flavor, but the same problems
&gt;&gt;             are there.
&gt;&gt;             The problems are not caused by MPI itself, but they become
&gt;&gt;             apparent when you run MPI programs.
&gt;&gt;             That has been my experience too.
&gt;&gt;
&gt;&gt;             As for how to map the MPI processes to specific processors
&gt;&gt;             (or cores),
&gt;&gt;             the key command seems to be &quot;taskset&quot;, as my
googling
&gt;&gt;             afternoon showed.
&gt;&gt;             Try &quot;man taskset&quot; for more info.
&gt;&gt;
&gt;&gt;             For a standalone machine like yours, something like the
&gt;&gt;             command line below should work to
&gt;&gt;             force execution on &quot;processors&quot; 0 and 2 (which
in my case
&gt;&gt;             are two different physical CPUs):
&gt;&gt;
&gt;&gt;             mpiexec -n 2 taskset -c 0,2  my_mpi_program
&gt;&gt;
&gt;&gt;             You need to check on your computer (&quot;more
/proc/cpuinfo&quot;)
&gt;&gt;             what are the exact &quot;processor&quot; numbers that
correspond to
&gt;&gt;             separate physical CPUs. Most likely they are the even
&gt;&gt;             numbered processors only, or the odd numbered only,
&gt;&gt;             since you have dual-core CPUs (integers module 2), with
&gt;&gt;             &quot;processors&quot; 0,1 being the four
&gt;&gt;             cores of the first physical CPU, &quot;processors&quot;
2,3 the
&gt;&gt;             cores of the second physical CPU, and so on.
&gt;&gt;             At least, this is what I see on my dual-core
&gt;&gt;             dual-processor machine.
&gt;&gt;             I would say for quad-cores the separate physical CPUs
&gt;&gt;             would be processors 0,4,8, etc,
&gt;&gt;             or 1,5,7, etc, and so on (integers module 4), with
&gt;&gt;             &quot;processors&quot; 0,1,2,3 being the four cores
&gt;&gt;             in the first physical CPU, and so on.
&gt;&gt;             In /proc/cpuinfo look for the keyword
&quot;processor&quot;.
&gt;&gt;             These are the numbers you need to use in &quot;taskset
-c&quot;.
&gt;&gt;             However, other helpful information comes in the keywords
&gt;&gt;             &quot;physical id&quot;,
&gt;&gt;             &quot;core id&quot;, &quot;siblings&quot;, and &quot;cpu
cores&quot;.
&gt;&gt;             They will allow you to map cores and physical CPUs to
&gt;&gt;             the &quot;processor&quot; number.
&gt;&gt;
&gt;&gt;             The &quot;taskset&quot;  command line above worked in one
of my
&gt;&gt;             standalone multicore machines,
&gt;&gt;             and I hope a variant of it will work on your machine also.
&gt;&gt;             It works with the &quot;mpiexec&quot; that comes with the
MPICH
&gt;&gt;             distribution, and also with
&gt;&gt;             the &quot;mpiexec&quot; associated to the Torque/PBS batch
system,
&gt;&gt;             which is nice for clusters as well.
&gt;&gt;
&gt;&gt;             &quot;Taskset&quot; can change the default behavior of the
Linux
&gt;&gt;             scheduler, which is to allow processes to
&gt;&gt;             be moved from one core/CPU to another during execution.
&gt;&gt;             The scheduler does this to ensure optimal CPU use (i.e.
&gt;&gt;             load balance).
&gt;&gt;             With taskset you can force execution to happen on the
&gt;&gt;             cores you specify on the command line,
&gt;&gt;             i.e. you can force the so called &quot;CPU affinity&quot;
you wish.
&gt;&gt;             Note that the &quot;taskset&quot; man page uses both the
terms &quot;CPU&quot;
&gt;&gt;             and &quot;processor&quot;, and doesn&#39;t use the term
&quot;core&quot;,
&gt;&gt;             which may be  a bit confusing. Make no mistake,
&gt;&gt;             &quot;processor&quot; and &quot;CPU&quot; there stand for
what we&#39;ve been
&gt;&gt;             calling &quot;core&quot; here.
&gt;&gt;
&gt;&gt;             Other postings that you may find useful on closely related
&gt;&gt;             topics are:
&gt;&gt;
&gt;&gt;            
<a href="http://www.ibm.com/developerworks/linux/library/l-scheduler/" target="_blank">http://www.ibm.com/developerworks/linux/library/l-scheduler/</a>
&gt;&gt;            
&lt;<a href="http://www.ibm.com/developerworks/linux/library/l-scheduler/" target="_blank">http://www.ibm.com/developerworks/linux/library/l-scheduler/</a>&gt;
&gt;&gt;            
<a href="http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html" target="_blank">http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html</a>
&gt;&gt;            
&lt;<a href="http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html" target="_blank">http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html</a>&gt;
&gt;&gt;
&gt;&gt;             I hope this helps,
&gt;&gt;
&gt;&gt;             Still, we have a long way to go to sort out how much of
&gt;&gt;             the multicore bottleneck can
&gt;&gt;             be ascribed to lack of memory bandwidth, and how much may
&gt;&gt;             be  perhaps associated to how
&gt;&gt;             memcpy is compiled by different compilers,
&gt;&gt;             or if there are other components of this problem that we
&gt;&gt;             don&#39;t see now.
&gt;&gt;
&gt;&gt;             Maybe our community won&#39;t find a solution to
Zach&#39;s
&gt;&gt;             problem: &quot;Why is my quad core slower than
cluster?&quot;
&gt;&gt;             However, I hope that through collaboration, and by sharing
&gt;&gt;             information,
&gt;&gt;             we may be able to nail down the root of the problem,
&gt;&gt;             and perhaps to find ways to improve the alarmingly bad
&gt;&gt;             performance
&gt;&gt;             some of us have reported on multicore machines.
&gt;&gt;
&gt;&gt;
&gt;&gt;             Gus Correa
&gt;&gt;
&gt;&gt;             -- 
&gt;&gt;             ------------------------------
&gt;&gt;             ---------------------------------------
&gt;&gt;             Gustavo J. Ponce Correa, PhD - Email:
&gt;&gt;             <a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a> &lt;mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>&gt;
&gt;&gt;             Lamont-Doherty Earth Observatory - Columbia University
&gt;&gt;             P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 -
USA
&gt;&gt;            
---------------------------------------------------------------------
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt;&gt;
&gt; 
&gt;</pre></div></div></blockquote></td></tr></tbody></table><br>

      </blockquote></div><br></div>