Slow speed after changing from serial to parallel (with ex2f.F)

Tue Apr 15 23:35:24 CDT 2008

Hi Satish, thank you very much for helping me run the ex2f.F code.

I think I've a clearer picture now. I believe I'm running on Dual-Core 
Intel Xeon 5160. The quad core is only on atlas3-01 to 04 and there's 
only 4 of them. I guess that the lower peak is because I'm using Xeon 
5160, while you are using Xeon X5355.

You mention about the speedups for MatMult and compare between KSPSolve. 
Are these the only things we have to look at? Because I see that some 
other event such as VecMAXPY also takes up a sizable % of the time. To 
get an accurate speedup, do I just compare the time taken by KSPSolve 
between different no. of processors or do I have to look at other events 
such as MatMult as well?

In summary, due to load imbalance, my speedup is quite bad. So maybe 
I'll just send your results to my school's engineer and see if they 
could do anything. For my part, I guess I'll just 've to wait?

Thank alot!

Satish Balay wrote:
> On Wed, 16 Apr 2008, Ben Tay wrote:
>
>   
>> I think you may be right. My school uses :
>>     
>
>   
>>   No of Nodes Processors Qty per node Total cores per node Memory per node  
>>   4 Quad-Core Intel Xeon X5355 2 8 16 GB  
>>   60 Dual-Core Intel Xeon 5160 2 4 8 GB
>>     
>
>
> I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355
> machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following:
>
> << Logs for my run are attached >>
>
> asterix:/home/balay/download-pine>grep MatMult *
> ex2f-600-1p.log:MatMult             1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
> ex2f-600-2p.log:MatMult             1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   632
> ex2f-600-4p.log:MatMult              969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100  0  15 11100100  0   724
> ex2f-600-8p.log:MatMult             1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100  0  16 11100100  0   749
> asterix:/home/balay/download-pine>grep KSPSolve *
> ex2f-600-1p.log:KSPSolve               1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
> ex2f-600-2p.log:KSPSolve               1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100   824
> ex2f-600-4p.log:KSPSolve               1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99  1024
> ex2f-600-8p.log:KSPSolve               1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100  1081
> asterix:/home/balay/download-pine>
>
>
> You get the following [with intel compilers?]:
>
> asterix:/home/balay/download-pine/x>grep MatMult *
> log.1:MatMult             1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11  0  0  0  13 11  0  0  0   239
> log.2:MatMult             1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100  0  11 11100100  0   315
> log.4:MatMult              969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00  8 11100100  0   8 11100100  0   321
> asterix:/home/balay/download-pine/x>grep KSPSolve *
> log.1:KSPSolve               1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   292
> log.2:KSPSolve               1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100  99100100100100   352
> log.4:KSPSolve               1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99  98100100100 99   461
> asterix:/home/balay/download-pine/x>
>
> What exact CPU was this run on?
>
> A couple of comments:
> - my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher
>   load imbalance on your machine]
> - The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you
> - Speedups I see for MatMult are:
>
> np   me   you
>
> 2   1.59   1.32
> 4   1.82   1.34
> 8   1.88
>
> --------------------------
>
> The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores.
>
> As Matt indicated perhaps in "Subject: general question on speed using quad core Xeons" thread,
> for sparse linear algebra - the performance is limited by memory bandwidth - not CPU
>
> So one have to look at the hardware memory architecture of the machine
> if you expect scalability.
>
> The 2x quad-core has a memory architecture that gives 11GB/s if one
> CPU-socket is used, but 22GB/s when both CPUs-sockets are used
> [irrespective of the number of cores in each CPU socket]. One
> inference is - max of 2 speedup can be obtained from such machine [due
> to 2 memory bank architecture].
>
> So if you have 2 such machines [i.e 4 memory banks] - then you can
> expect a theoretical max speedup of 4.
>
> We are generally used to evaluating performance/cpu [or core]. Here
> the scalability numbers suck.
>
> However if you do performance/number-of-memory-banks - then things look better.
>
> Its just that we are used to always expecting scalability per node and
> assume it translates to scalability per core. [however the scalability
> per node - was more about scalability per memory bank - before
> multicore cpus took over]
>
>
> There is also another measure - performance/dollar spent. Generally
> the extra cores are practically free - so here this measure also holds
> up ok.
>
> Satish