<div dir="ltr">If you people are going to fight like children and clog up my email with garbage like this, at least learn how to spell.<br><br><div class="gmail_quote">On Thu, Jul 17, 2008 at 10:20 AM, chong tan &lt;<a href="mailto:chong_guan_tan@yahoo.com">chong_guan_tan@yahoo.com</a>&gt; wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-size: inherit; line-height: inherit; font-size-adjust: inherit; font-stretch: inherit;" valign="top">

<p>grow up Gib.&nbsp; there is no NDA between empoer and employee in the USA.&nbsp; maybe there is in New Zealand, but I certainly don&#39;t know anything about New Zealand.&nbsp; I only know taht when a person buy a computer in NZ, the computer user count in NZ just doubled.</p>

<p>tan</p>

<p><br><br>--- On <b>Wed, 7/16/08, Gib Bogle <i>&lt;<a href="mailto:g.bogle@auckland.ac.nz" target="_blank">g.bogle@auckland.ac.nz</a>&gt;</i></b> wrote:<br></p>

<blockquote style="border-left: 2px solid rgb(16, 16, 255); padding-left: 5px; margin-left: 5px;">From: Gib Bogle &lt;<a href="mailto:g.bogle@auckland.ac.nz" target="_blank">g.bogle@auckland.ac.nz</a>&gt;<div class="Ih2E3d">

<br>Subject: Re: [mpich-discuss] Why is my quad core slower than cluster<br>To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br></div>Date: Wednesday, July 16, 2008, 6:31 PM<div>

<div></div><div class="Wj3C7c"><br><br><pre>And are we to believe that your employer&#39;s NDA doesn&#39;t permit you to say

WHICH answer is wrong?  Whew!

Gib

chong tan wrote:

&gt; Just FYI,

&gt; 

&gt; from my knowledge, at least 1 answer to the question in that thread is 

&gt; absolutely wrong, according to HW information on hand.  Some of the info 

&gt; in that thread are not applicable across the board, and the original 

&gt; question : threaded application, is not answered.

&gt; 

&gt;  

&gt; 

&gt; whether to use numactl on NUMA system is situation dependent.  In 

&gt; general, numactl is bad if you over subscribe the system.

&gt; 

&gt;  

&gt; 

&gt; tan

&gt; 

&gt; 

&gt; 

&gt; --- On *Tue, 7/15/08, Robert Kubrick /&lt;<a href="mailto:robertkubrick@gmail.com" target="_blank">robertkubrick@gmail.com</a>&gt;/*

wrote:

&gt; 

&gt;     From: Robert Kubrick &lt;<a href="mailto:robertkubrick@gmail.com" target="_blank">robertkubrick@gmail.com</a>&gt;

&gt;     Subject: Re: [mpich-discuss] Why is my quad core slower than cluster

&gt;     To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>

&gt;     Date: Tuesday, July 15, 2008, 4:06 PM

&gt; 

&gt;     A recent (long) discussion about numactl and taskset on the beowulf

&gt;     mailing list:

&gt;     <a href="http://www.beowulf.org/archive/2008-June/021810.html" target="_blank">http://www.beowulf.org/archive/2008-June/021810.html</a>

&gt; 

&gt;     On Jul 15, 2008, at 1:35 PM, chong tan wrote:

&gt; 

&gt;&gt;     Eric,

&gt;&gt;

&gt;&gt;     I know you are referring me as the one not sharing.  I am no

&gt;&gt;     expert on MP, but someone who have done his homeworks.  I like to

&gt;&gt;     share, but the NDAs and company policy say no.   

&gt;&gt;

&gt;&gt;     You have good points and did some good experiements.  That is what

&gt;&gt;     I expect most MP designers and users to have done at the first

place.

&gt;&gt;

&gt;&gt;     There answers to the original question are simple :

&gt;&gt;

&gt;&gt;     - on 2Xquad, you have one memory system, while on cluster, you

&gt;&gt;     have 8 memory systems, the total bandwidth favor the cluster

&gt;&gt;     considerably.

&gt;&gt;

&gt;&gt;     - on cluster, there is not way for the process to be context

&gt;&gt;     switched, while that can happen on 2XQuad.  When this happens,

&gt;&gt;     live is bad.

&gt;&gt;

&gt;&gt;     - The only thing that favor the SMP is the cost of communication

&gt;&gt;     and shared memory.

&gt;&gt;

&gt;&gt;      

&gt;&gt;

&gt;&gt;     There are more factors, Thea rt is balancing them to your favor. 

&gt;&gt;     In a way, the X86 Quad are not designed to let us load it up with

&gt;&gt;     fat adnd heavy processes.  That is what I have been saying all

&gt;&gt;     along: know your HW first.  Your MP solution should come second. 

&gt;&gt;     Whatever utilities you can find will help put the solution

together.

&gt;&gt;

&gt;&gt;      

&gt;&gt;

&gt;&gt;     So, the problem is not MPI in this case.

&gt;&gt;

&gt;&gt;      

&gt;&gt;

&gt;&gt;     tan

&gt;&gt;

&gt;&gt;

&gt;&gt;

&gt;&gt;     --- On *Mon, 7/14/08, Eric A. Borisch /&lt;<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>

&gt;&gt;     &lt;mailto:<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>&gt;&gt;/* wrote:

&gt;&gt;

&gt;&gt;         From: Eric A. Borisch &lt;<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>

&gt;&gt;         &lt;mailto:<a href="mailto:eborisch@ieee.org" target="_blank">eborisch@ieee.org</a>&gt;&gt;

&gt;&gt;         Subject: Re: [mpich-discuss] Why is my quad core slower than

&gt;&gt;         cluster

&gt;&gt;         To: <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>

&lt;mailto:<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a>&gt;

&gt;&gt;         Date: Monday, July 14, 2008, 9:36 PM

&gt;&gt;

&gt;&gt;         Gus,

&gt;&gt;

&gt;&gt;         Information sharing is truly the point of the mailing list.

&gt;&gt;         Useful messages should ask questions or provide answers! :)

&gt;&gt;

&gt;&gt;         Someone mentioned STREAM benchmarks (memory BW benchmarks) a

&gt;&gt;         little while back. I did these when our new system came in a

&gt;&gt;         while ago, so I dug them back out.

&gt;&gt;

&gt;&gt;         This (STREAM) can be compiled to use MPI, but it is only a

&gt;&gt;         synchronization tool, the benchmark is still a memory bus test

&gt;&gt;         (each task is trying to run through memory, but this is not an

&gt;&gt;         MPI communication test.)

&gt;&gt;

&gt;&gt;         My results on a dual E5472 machine (Two Quad-core 3Ghz

&gt;&gt;         packages; 1600MHz bus; 8 total cores)

&gt;&gt;

&gt;&gt;         Results (each set are [1..8] processes in order),

&gt;&gt;         double-precision array size = 20,000,000, run through 10

times.

&gt;&gt;

&gt;&gt;         Function     Rate (MB/s)  Avg time   Min time  Max time

&gt;&gt;         Copy:       2962.6937      0.1081      0.1080      0.1081

&gt;&gt;         Copy:       5685.3008      0.1126      0.1126      0.1128

&gt;&gt;         Copy:       5484.6846      0.1751      0.1750      0.1751

&gt;&gt;         Copy:       7085.7959      0.1809      0.1806      0.1817

&gt;&gt;         Copy:       5981.6033      0.2676      0.2675      0.2676

&gt;&gt;         Copy:       7071.2490      0.2718      0.2715      0.2722

&gt;&gt;         Copy:       6537.4934      0.3427      0.3426      0.3428

&gt;&gt;         Copy:       7423.4545      0.3451      0.3449      0.3455

&gt;&gt;

&gt;&gt;         Scale:      3011.8445      0.1063      0.1062      0.1063

&gt;&gt;         Scale:      5675.8162      0.1128      0.1128      0.1129

&gt;&gt;         Scale:      5474.8854      0.1754      0.1753      0.1754

&gt;&gt;         Scale:      7068.6204      0.1814      0.1811      0.1819

&gt;&gt;         Scale:      5974.6112      0.2679      0.2678      0.2680

&gt;&gt;         Scale:      7063.8307      0.2721      0.2718      0.2725

&gt;&gt;         Scale:      6533.4473      0.3430      0.3429      0.3431

&gt;&gt;         Scale:      7418.6128      0.3453      0.3451      0.3456

&gt;&gt;

&gt;&gt;         Add:        3184.3129      0.1508      0.1507      0.1508

&gt;&gt;         Add:        5892.1781      0.1631      0.1629      0.1633

&gt;&gt;         Add:        5588.0229      0.2577      0.2577      0.2578

&gt;&gt;         Add:        7275.0745      0.2642      0.2639      0.2646

&gt;&gt;         Add:        6175.7646      0.3887      0.3886      0.3889

&gt;&gt;         Add:        7262.7112      0.3970      0.3965      0.3976

&gt;&gt;         Add:        6687.7658      0.5025      0.5024      0.5026

&gt;&gt;         Add:        7599.2516      0.5057      0.5053      0.5062

&gt;&gt;

&gt;&gt;         Triad:      3224.7856      0.1489      0.1488      0.1489

&gt;&gt;         Triad:      6021.2613      0.1596      0.1594      0.1598

&gt;&gt;         Triad:      5609.9260      0.2567      0.2567      0.2568

&gt;&gt;         Triad:      7293.2790      0.2637      0.2633      0.2641

&gt;&gt;         Triad:      6185.4376      0.3881      0.3880      0.3881

&gt;&gt;         Triad:      7279.1231      0.3958      0.3957      0.3961

&gt;&gt;         Triad:      6691.8560      0.5022      0.5021      0.5022

&gt;&gt;         Triad:      7604.1238      0.5052      0.5050      0.5057

&gt;&gt;

&gt;&gt;         These work out to (~):

&gt;&gt;         1x

&gt;&gt;         1.9x

&gt;&gt;         1.8x

&gt;&gt;         2.3x

&gt;&gt;         1.9x

&gt;&gt;         2.2x

&gt;&gt;         2.1x

&gt;&gt;         2.4x

&gt;&gt;          

&gt;&gt;         for [1..8] cores.

&gt;&gt;

&gt;&gt;         As you can see, it doesn&#39;t take eight cores to saturate

the

&gt;&gt;         bus, even with a 1600MHz bus. Four of the eight cores running

&gt;&gt;         does this trick.

&gt;&gt;

&gt;&gt;         With all that said, there are still advantages to be had with

&gt;&gt;         the multicore chipsets, but only if you&#39;re not blowing

full

&gt;&gt;         tilt through memory. If it can fit the problem, do more inside

&gt;&gt;         a loop rather than running multiple loops over the same

memory. 

&gt;&gt;

&gt;&gt;         For reference, here&#39;s what using the osu_mbw_mr test (from

&gt;&gt;         MVAPICH2 1.0.2; I also have a cluster running nearby :)

&gt;&gt;         compiled on MPICH2 (1.0.7rc1 with nemesis provides this

&gt;&gt;         performance from one/two/four pairs (2/4/8 processes) of

&gt;&gt;         producer/consumers:

&gt;&gt;

&gt;&gt;         # OSU MPI Multi BW / Message Rate Test (Version 1.0)

&gt;&gt;         # [ pairs: 1 ] [ window size: 64 ]

&gt;&gt;

&gt;&gt;         #  Size    MB/sec    Messages/sec

&gt;&gt;               1      1.08   1076540.83

&gt;&gt;               2      2.14   1068102.24

&gt;&gt;               4      3.99    997382.24

&gt;&gt;               8      7.97    996419.66

&gt;&gt;              16     15.95    996567.63

&gt;&gt;              32     31.67    989660.29

&gt;&gt;              64     62.73    980084.91

&gt;&gt;             128    124.12    969676.18

&gt;&gt;             256    243.59    951527.62

&gt;&gt;             512    445.52    870159.34

&gt;&gt;            1024    810.28    791284.80

&gt;&gt;            2048   1357.25    662721.78

&gt;&gt;            4096   1935.08    472431.28

&gt;&gt;            8192   2454.29    299596.49

&gt;&gt;           16384   2717.61    165869.84

&gt;&gt;           32768   2900.23     88507.85

&gt;&gt;           65536   2279.71     34785.63

&gt;&gt;          131072   2540.51     19382.53

&gt;&gt;          262144   1335.16      5093.21

&gt;&gt;          524288   1364.05      2601.72

&gt;&gt;         1048576   1378.39      1314.53

&gt;&gt;         2097152   1380.78       658.41

&gt;&gt;         4194304   1343.48       320.31

&gt;&gt;

&gt;&gt;         # OSU MPI Multi BW / Message Rate Test (Version 1.0)

&gt;&gt;         # [ pairs: 2 ] [ window size: 64 ]

&gt;&gt;

&gt;&gt;         #  Size    MB/sec    Messages/sec

&gt;&gt;               1      2.15   2150580.48

&gt;&gt;               2      4.22   2109761.12

&gt;&gt;               4      7.84   1960742.53

&gt;&gt;               8     15.80   1974733.92

&gt;&gt;              16     31.38   1961100.64

&gt;&gt;              32     62.32   1947654.32

&gt;&gt;              64    123.39   1928000.11

&gt;&gt;             128    243.19   1899957.22

&gt;&gt;             256    475.32   1856721.12

&gt;&gt;             512    856.90   1673642.10

&gt;&gt;            1024   1513.19   1477721.26

&gt;&gt;            2048   2312.91   1129351.07

&gt;&gt;            4096   2891.21    705861.12

&gt;&gt;            8192   3267.49    398863.98

&gt;&gt;           16384   3400.64    207558.54

&gt;&gt;           32768   3519.74    107413.93

&gt;&gt;           65536   3141.80     47940.04

&gt;&gt;          131072   3368.65     25700.76

&gt;&gt;          262144   2211.53      8436.31

&gt;&gt;          524288   2264.90      4319.95

&gt;&gt;         1048576   2282.69      2176.94

&gt;&gt;         2097152   2250.72      1073.23

&gt;&gt;         4194304   2087.00       497.58

&gt;&gt;

&gt;&gt;         # OSU MPI Multi BW / Message Rate Test (Version 1.0)

&gt;&gt;         # [ pairs: 4 ] [ window size: 64 ]

&gt;&gt;

&gt;&gt;         #  Size    MB/sec    Messages/sec

&gt;&gt;               1      3.65   3651934.64

&gt;&gt;               2      8.16   4080341.34

&gt;&gt;               4     15.66   3914908.02

&gt;&gt;               8     31.32   3915621.85

&gt;&gt;              16     62.67   3916764.51

&gt;&gt;              32    124.37   3886426.18

&gt;&gt;              64    246.38   3849640.84

&gt;&gt;             128    486.39   3799914.44

&gt;&gt;             256    942.40   3681232.25

&gt;&gt;             512   1664.21   3250414.19

&gt;&gt;            1024   2756.50   2691891.86

&gt;&gt;            2048   3829.45   1869848.54

&gt;&gt;            4096   4465.25   1090148.56

&gt;&gt;            8192   4777.45    583184.51

&gt;&gt;           16384   4822.75    294357.30

&gt;&gt;           32768   4829.77    147392.80

&gt;&gt;           65536   4556.93     69533.18

&gt;&gt;          131072   4789.32     36539.60

&gt;&gt;          262144   3631.68     13853.75

&gt;&gt;          524288   3679.31      7017.72

&gt;&gt;         1048576   3553.61      3388.99

&gt;&gt;         2097152   3113.12      1484.45

&gt;&gt;         4194304   2452.69       584.77

&gt;&gt;

&gt;&gt;         So from a messaging standpoint, you can see that you squeeze

&gt;&gt;         more data through with more processes; I&#39;d guess that this

is

&gt;&gt;         because there&#39;s processing to be done within MPI to move

the

&gt;&gt;         data, and a lot of the bookkeeping steps probably cache well

&gt;&gt;         (updating the same status structure on a communication

&gt;&gt;         multiple times; perhaps reusing the structure for subsequent

&gt;&gt;         transfers and finding it still in cache) so the performance

&gt;&gt;         scaling is not completely FSB bound.

&gt;&gt;

&gt;&gt;         I&#39;m sure there&#39;s plenty of additional things that

could be

&gt;&gt;         done here to test different CPU to process layouts, etc, but

&gt;&gt;         in testing my own real-world code, I&#39;ve found that,

&gt;&gt;         unfortunately, &quot;it depends.&quot; I have some code that

nearly

&gt;&gt;         scales linearly (multiple computationally expensive operations

&gt;&gt;         inside the innermost loop) and some that scales like the

&gt;&gt;         STREAM results above (&quot;add one to the next 20 million

points&quot;) ...

&gt;&gt;

&gt;&gt;         As always, your mileage may vary. If your speedup looks like

&gt;&gt;         the STREAM numbers above, you&#39;re likely memory bound. Try

to

&gt;&gt;         reformulate your problem to go through memory slower but with

&gt;&gt;         more done each pass, or invest in a cluster. At some point --

&gt;&gt;         for some problems -- you can&#39;t beat more memory busses!

&gt;&gt;

&gt;&gt;         Cheers,

&gt;&gt;          Eric Borisch

&gt;&gt;

&gt;&gt;         --

&gt;&gt;          <a href="mailto:borisch.eric@mayo.edu" target="_blank">borisch.eric@mayo.edu</a> &lt;mailto:<a href="mailto:borisch.eric@mayo.edu" target="_blank">borisch.eric@mayo.edu</a>&gt;

&gt;&gt;          MRI Research

&gt;&gt;          Mayo Clinic

&gt;&gt;

&gt;&gt;         On Mon, Jul 14, 2008 at 9:48 PM, Gus Correa

&gt;&gt;         &lt;<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>

&lt;mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>&gt;&gt; wrote:

&gt;&gt;

&gt;&gt;             Hello Sami and list

&gt;&gt;

&gt;&gt;             Oh, well, as you see, an expert who claims to know the

&gt;&gt;             answers to these problems

&gt;&gt;             seems not to be willing to share these answers with less

&gt;&gt;             knowledgeable MPI users like us.

&gt;&gt;             So, maybe we can find the answers ourselves, not by

&gt;&gt;             individual &quot;homework&quot; brainstorming,

&gt;&gt;             but through community collaboration and generous

&gt;&gt;             information sharing,

&gt;&gt;             which is the hallmark of this mailing list.

&gt;&gt;

&gt;&gt;             I Googled around today to find out how to assign MPI

&gt;&gt;             processes to specific processors,

&gt;&gt;             and I found some interesting information on how to do it.

&gt;&gt;

&gt;&gt;             Below is a link to a posting from the computational fluid

&gt;&gt;             dynamics (CFD) community that may be of interest.

&gt;&gt;             Not surprisingly, they are struggling with the same type

&gt;&gt;             of problems all of us have,

&gt;&gt;             including how to tie MPI processes to specific processors:

&gt;&gt;

&gt;&gt;            

<a href="http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006" target="_blank">http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006</a>

&gt;&gt;            

&lt;<a href="http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006" target="_blank">http://openfoam.cfd-online.com/cgi-bin/forum/board-auth.cgi?file=/1/5949.html#POST18006</a>&gt;

&gt;&gt;

&gt;&gt;             I would summarize these problems as related to three types

&gt;&gt;             of bottleneck:

&gt;&gt;

&gt;&gt;             1) Multicore processor bottlenecks (standalone machines

&gt;&gt;             and clusters)

&gt;&gt;             2) Network fabric bottlenecks (clusters)

&gt;&gt;             3) File system bottlenecks (clusters)

&gt;&gt;

&gt;&gt;             All three types of problems are due to contention for some

&gt;&gt;             type of system resource

&gt;&gt;             by the MPI processes that take part in a

computation/program.

&gt;&gt;

&gt;&gt;             Our focus on this thread, started by Zach, has been on

&gt;&gt;             problem 1),

&gt;&gt;             although most of us may need to look into problems 2) and

&gt;&gt;             3) sooner or later.

&gt;&gt;             (I have all the three of them already!)

&gt;&gt;

&gt;&gt;             The CFD folks use MPI as we do.

&gt;&gt;             They seem to use another MPI flavor, but the same problems

&gt;&gt;             are there.

&gt;&gt;             The problems are not caused by MPI itself, but they become

&gt;&gt;             apparent when you run MPI programs.

&gt;&gt;             That has been my experience too.

&gt;&gt;

&gt;&gt;             As for how to map the MPI processes to specific processors

&gt;&gt;             (or cores),

&gt;&gt;             the key command seems to be &quot;taskset&quot;, as my

googling

&gt;&gt;             afternoon showed.

&gt;&gt;             Try &quot;man taskset&quot; for more info.

&gt;&gt;

&gt;&gt;             For a standalone machine like yours, something like the

&gt;&gt;             command line below should work to

&gt;&gt;             force execution on &quot;processors&quot; 0 and 2 (which

in my case

&gt;&gt;             are two different physical CPUs):

&gt;&gt;

&gt;&gt;             mpiexec -n 2 taskset -c 0,2  my_mpi_program

&gt;&gt;

&gt;&gt;             You need to check on your computer (&quot;more

/proc/cpuinfo&quot;)

&gt;&gt;             what are the exact &quot;processor&quot; numbers that

correspond to

&gt;&gt;             separate physical CPUs. Most likely they are the even

&gt;&gt;             numbered processors only, or the odd numbered only,

&gt;&gt;             since you have dual-core CPUs (integers module 2), with

&gt;&gt;             &quot;processors&quot; 0,1 being the four

&gt;&gt;             cores of the first physical CPU, &quot;processors&quot;

2,3 the

&gt;&gt;             cores of the second physical CPU, and so on.

&gt;&gt;             At least, this is what I see on my dual-core

&gt;&gt;             dual-processor machine.

&gt;&gt;             I would say for quad-cores the separate physical CPUs

&gt;&gt;             would be processors 0,4,8, etc,

&gt;&gt;             or 1,5,7, etc, and so on (integers module 4), with

&gt;&gt;             &quot;processors&quot; 0,1,2,3 being the four cores

&gt;&gt;             in the first physical CPU, and so on.

&gt;&gt;             In /proc/cpuinfo look for the keyword

&quot;processor&quot;.

&gt;&gt;             These are the numbers you need to use in &quot;taskset

-c&quot;.

&gt;&gt;             However, other helpful information comes in the keywords

&gt;&gt;             &quot;physical id&quot;,

&gt;&gt;             &quot;core id&quot;, &quot;siblings&quot;, and &quot;cpu

cores&quot;.

&gt;&gt;             They will allow you to map cores and physical CPUs to

&gt;&gt;             the &quot;processor&quot; number.

&gt;&gt;

&gt;&gt;             The &quot;taskset&quot;  command line above worked in one

of my

&gt;&gt;             standalone multicore machines,

&gt;&gt;             and I hope a variant of it will work on your machine also.

&gt;&gt;             It works with the &quot;mpiexec&quot; that comes with the

MPICH

&gt;&gt;             distribution, and also with

&gt;&gt;             the &quot;mpiexec&quot; associated to the Torque/PBS batch

system,

&gt;&gt;             which is nice for clusters as well.

&gt;&gt;

&gt;&gt;             &quot;Taskset&quot; can change the default behavior of the

Linux

&gt;&gt;             scheduler, which is to allow processes to

&gt;&gt;             be moved from one core/CPU to another during execution.

&gt;&gt;             The scheduler does this to ensure optimal CPU use (i.e.

&gt;&gt;             load balance).

&gt;&gt;             With taskset you can force execution to happen on the

&gt;&gt;             cores you specify on the command line,

&gt;&gt;             i.e. you can force the so called &quot;CPU affinity&quot;

you wish.

&gt;&gt;             Note that the &quot;taskset&quot; man page uses both the

terms &quot;CPU&quot;

&gt;&gt;             and &quot;processor&quot;, and doesn&#39;t use the term

&quot;core&quot;,

&gt;&gt;             which may be  a bit confusing. Make no mistake,

&gt;&gt;             &quot;processor&quot; and &quot;CPU&quot; there stand for

what we&#39;ve been

&gt;&gt;             calling &quot;core&quot; here.

&gt;&gt;

&gt;&gt;             Other postings that you may find useful on closely related

&gt;&gt;             topics are:

&gt;&gt;

&gt;&gt;            

<a href="http://www.ibm.com/developerworks/linux/library/l-scheduler/" target="_blank">http://www.ibm.com/developerworks/linux/library/l-scheduler/</a>

&gt;&gt;            

&lt;<a href="http://www.ibm.com/developerworks/linux/library/l-scheduler/" target="_blank">http://www.ibm.com/developerworks/linux/library/l-scheduler/</a>&gt;

&gt;&gt;            

<a href="http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html" target="_blank">http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html</a>

&gt;&gt;            

&lt;<a href="http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html" target="_blank">http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html</a>&gt;

&gt;&gt;

&gt;&gt;             I hope this helps,

&gt;&gt;

&gt;&gt;             Still, we have a long way to go to sort out how much of

&gt;&gt;             the multicore bottleneck can

&gt;&gt;             be ascribed to lack of memory bandwidth, and how much may

&gt;&gt;             be  perhaps associated to how

&gt;&gt;             memcpy is compiled by different compilers,

&gt;&gt;             or if there are other components of this problem that we

&gt;&gt;             don&#39;t see now.

&gt;&gt;

&gt;&gt;             Maybe our community won&#39;t find a solution to

Zach&#39;s

&gt;&gt;             problem: &quot;Why is my quad core slower than

cluster?&quot;

&gt;&gt;             However, I hope that through collaboration, and by sharing

&gt;&gt;             information,

&gt;&gt;             we may be able to nail down the root of the problem,

&gt;&gt;             and perhaps to find ways to improve the alarmingly bad

&gt;&gt;             performance

&gt;&gt;             some of us have reported on multicore machines.

&gt;&gt;

&gt;&gt;

&gt;&gt;             Gus Correa

&gt;&gt;

&gt;&gt;             -- 

&gt;&gt;             ------------------------------

&gt;&gt;             ---------------------------------------

&gt;&gt;             Gustavo J. Ponce Correa, PhD - Email:

&gt;&gt;             <a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a> &lt;mailto:<a href="mailto:gus@ldeo.columbia.edu" target="_blank">gus@ldeo.columbia.edu</a>&gt;

&gt;&gt;             Lamont-Doherty Earth Observatory - Columbia University

&gt;&gt;             P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 -

USA

&gt;&gt;            

---------------------------------------------------------------------

&gt;&gt;

&gt;&gt;

&gt;&gt;

&gt;&gt;

&gt; 

&gt;</pre></div></div></blockquote></td></tr></tbody></table><br>

      </blockquote></div><br></div>