[petsc-users] testing scalability for ksp/ex22.c (now ex45.c)

Fri Feb 24 21:36:46 CST 2012

   You need to understand FOR THIS SPECIFIC MACHINE ARCHITECTURE  what happens when it gos from using one processor to two? Is the second processor another core that shares common memory with the first processor? Or is it a completely separate core on a different node that has its own memory? Consider

MatSOR                48 1.0 3.8607e+01 1.0 3.27e+09 1.0 0.0e+00 0.0e+00 0.0e+00 24 45  0  0  0  24 45  0  0  0         85
MatSOR                48 1.0 2.5600e+01 1.1 1.64e+09 1.0 4.8e+01 1.2e+05 4.8e+01 22 45 19 30  8  22 45 19 30  8   128

this computation across the two cores is embarrassing parallel, hence the flop rate for two processes should be 170, not 128 (the ratio is .75 very close to the .74 efficiency you get on two processes). So why is it not 170? The most likely answer is that the two cores shared a common memory and that memory is not fast enough (does not have enough memory bandwidth) to server both cores at the individual speed (85) that each of them can run.

This is the curse of virtually any shared memory systems (that most people like to gloss over). Unless the memory bandwidth grows linearly with the number of cores you use the performance cannot grow linearly with the number of cores. On almost no system with shared memory does the memory bandwidth grow in that direction.

You need to find out for this machine how to direct the executable to be run so that each of the two processes runs on different NODES of the system so they don't shared memory, then you will see a number better than the .74 for the two processes.  But note that once you want to use all the cores on the system you will have cores shared memory and your parallel efficiency will go down.  This is all material that should be presented in the first week of a parallel computing class and is crucial to understand if one plans to do "parallel computing".

   Barry

On Feb 24, 2012, at 9:20 PM, Francis Poulin wrote:

> Hello Barry,
> 
> Thanks for offering to look at this.
> 
> I configured a different version that does not have the debugger and the results are a lot faster but the efficiency is still similar.  Below you'll see the results.  
> 
> n		Tp 		Efficiency
> -------------------------------------
> 1		162		1
> 2		110		0.74
> 4		47		0.86
> 8		24		0.84
> 16		12		0.84
> 32		6		0.84
> 
> I'm also including the output of the log_summary for n =1 and 2.  
> 
> As I said before, this is an SMP machine so I would expect it to be better than the typical cluster.  That being said I am still learning how this works.
> 
> I am very happy that I managed to get some encouraging results in a day.  
> 
> Cheers, Francis
> 
> 
> <output_n1.txt>
> <output_n2.txt>
> 
> On 2012-02-24, at 9:08 PM, Jed Brown wrote:
> 
>> On Fri, Feb 24, 2012 at 14:43, Francis Poulin <fpoulin at uwaterloo.ca> wrote:
>> It seems to scale beautifully starting at 2 but there is a big drop from 1 to 2.  I suspect there's a very good reason for this, I just don't know what.
>> 
>> Can you send the full output? The smoother is slightly different, so the number of iterations could be different by 1 (for example). The -log_summary part of the output will show us where the time is being spent, so we'll be able to say what did not scale well between 1 and 2 procs.
>