general question on speed using quad core Xeons

Matthew Knepley knepley at gmail.com
Tue Apr 15 19:46:17 CDT 2008


On Tue, Apr 15, 2008 at 7:41 PM, Randall Mackie <rlmackie862 at gmail.com> wrote:
> Then what's the point of having 4 and 8 cores per cpu for parallel
>  computations then? I mean, I think I've done all I can to make
>  my code as efficient as possible.

I really advise reading the paper. It explicitly treats the case of
blocking, and uses
a simple model to demonstrate all the points I made.

With a single, scalar sparse matrix, there is definitely no point at
all of having
multiple cores. However, this will speed up things like finite element
integration.
So, for instance, making this integration dominate your cost (like
spectral element
codes do) will show nice speedup. Ulrich Ruede has a great talk about this on
his website.

  Matt

>  I'm not quite sure I understand your comment about using blocks
>  or unassembled structures.
>
>
>  Randy
>
>
>
>
>  Matthew Knepley wrote:
>
> > On Tue, Apr 15, 2008 at 7:19 PM, Randall Mackie <rlmackie862 at gmail.com>
> wrote:
> >
> > > I'm running my PETSc code on a cluster of quad core Xeon's connected
> > >  by Infiniband. I hadn't much worried about the performance, because
> > >  everything seemed to be working quite well, but today I was actually
> > >  comparing performance (wall clock time) for the same problem, but on
> > >  different combinations of CPUS.
> > >
> > >  I find that my PETSc code is quite scalable until I start to use
> > >  multiple cores/cpu.
> > >
> > >  For example, the run time doesn't improve by going from 1 core/cpu
> > >  to 4 cores/cpu, and I find this to be very strange, especially since
> > >  looking at top or Ganglia, all 4 cpus on each node are running at 100%
> > > almost
> > >  all of the time. I would have thought if the cpus were going all out,
> > >  that I would still be getting much more scalable results.
> > >
> >
> > Those a really coarse measures. There is absolutely no way that all cores
> > are going 100%. Its easy to show by hand. Take the peak flop rate and
> > this gives you the bandwidth needed to sustain that computation (if
> > everything is perfect, like axpy). You will find that the chip bandwidth
> > is far below this. A nice analysis is in
> >
> >  http://www.mcs.anl.gov/~kaushik/Papers/pcfd99_gkks.pdf
> >
> >
> > >  We are using mvapich-0.9.9 with infiniband. So, I don't know if
> > >  this is a cluster/Xeon issue, or something else.
> > >
> >
> > This is actually mathematics! How satisfying. The only way to improve
> > this is to change the data structure (e.g. use blocks) or change the
> > algorithm (e.g. use spectral elements and unassembled structures)
> >
> >  Matt
> >
> >
> > >  Anybody with experience on this?
> > >
> > >  Thanks, Randy M.
> > >
> > >
> > >
> >
> >
> >
> >
>
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener




More information about the petsc-users mailing list