general question on speed using quad core Xeons

Tue Apr 15 21:03:15 CDT 2008

Okay, but if I'm stuck with a big 3D finite difference code, written in PETSc
using Distributed Arrays, with 3 dof per node, then you're saying there is
really nothing I can do, except using blocking, to improve things on quad
core cpus? They talk about blocking using BAIJ format, and so is this the
same thing as creating MPIBAIJ matrices in PETSc? And is creating MPIBAIJ
matrices in PETSc going to make a substantial difference in the speed?

I'm sorry if I'm being dense, I'm just trying to understand if there is some
simple way I can utilize those extra cores on each cpu easily, and since
I'm not a computer scientist, some of these concepts are difficult.

Thanks, Randy

Matthew Knepley wrote:
> On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
>> Then what's the point of having 4 and 8 cores per cpu for parallel
>>  computations then? I mean, I think I've done all I can to make
>>  my code as efficient as possible.
> 
> I really advise reading the paper. It explicitly treats the case of
> blocking, and uses
> a simple model to demonstrate all the points I made.
> 
> With a single, scalar sparse matrix, there is definitely no point at
> all of having
> multiple cores. However, this will speed up things like finite element
> integration.
> So, for instance, making this integration dominate your cost (like
> spectral element
> codes do) will show nice speedup. Ulrich Ruede has a great talk about this on
> his website.
> 
>   Matt
> 
>>  I'm not quite sure I understand your comment about using blocks
>>  or unassembled structures.
>>
>>
>>  Randy
>>
>>
>>
>>
>>  Matthew Knepley wrote:
>>
>>> On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com>
>> wrote:
>>>> I'm running my PETSc code on a cluster of quad core Xeon's connected
>>>>  by Infiniband. I hadn't much worried about the performance, because
>>>>  everything seemed to be working quite well, but today I was actually
>>>>  comparing performance (wall clock time) for the same problem, but on
>>>>  different combinations of CPUS.
>>>>
>>>>  I find that my PETSc code is quite scalable until I start to use
>>>>  multiple cores/cpu.
>>>>
>>>>  For example, the run time doesn't improve by going from 1 core/cpu
>>>>  to 4 cores/cpu, and I find this to be very strange, especially since
>>>>  looking at top or Ganglia, all 4 cpus on each node are running at 100%
>>>> almost
>>>>  all of the time. I would have thought if the cpus were going all out,
>>>>  that I would still be getting much more scalable results.
>>>>
>>> Those a really coarse measures. There is absolutely no way that all cores
>>> are going 100%. Its easy to show by hand. Take the peak flop rate and
>>> this gives you the bandwidth needed to sustain that computation (if
>>> everything is perfect, like axpy). You will find that the chip bandwidth
>>> is far below this. A nice analysis is in
>>>
>>>  http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
>>>
>>>
>>>>  We are using mvapich-0.9.9 with infiniband. So, I don't know if
>>>>  this is a cluster/Xeon issue, or something else.
>>>>
>>> This is actually mathematics! How satisfying. The only way to improve
>>> this is to change the data structure (e.g. use blocks) or change the
>>> algorithm (e.g. use spectral elements and unassembled structures)
>>>
>>>  Matt
>>>
>>>
>>>>  Anybody with experience on this?
>>>>
>>>>  Thanks, Randy M.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
>