Can PETSc detect the number of CPUs on each computer node?

Tue Jun 16 14:23:27 CDT 2009

On Tuesday 16 June 2009 02:29:14 pm Matthew Knepley wrote:
> On Tue, Jun 16, 2009 at 1:13 PM, Alex Peyser <peyser.alex at gmail.com> wrote:
> > On Tuesday 16 June 2009 01:53:35 pm Matthew Knepley wrote:
> > > On Tue, Jun 16, 2009 at 12:38 PM, xiaoyin ji
> > > <sapphire.jxy at gmail.com<mailto:sapphire.jxy at gmail.com>> wrote: Hi
> > > there,
> > >
> > > I'm using PETSc MATMPIAIJ and ksp solver. It seems that PETSc will run
> > > obviously faster if I set the number of CPUs close to the number of
> > > computer nodes in the job file. By default MPIAIJ matrix is stored in
> > > different processors and ksp solver will communicate for each step,
> > > however since on each node several CPUs share the same memory while
> > > ksp may still try to communicate through network card, this may mess
> > > up a bit. Is there any way to detect which CPUs are sharing the same
> > > memory? Thanks a lot.
> > >
> > > The interface for this is mpirun or the job submission mechanism.
> > >
> > >    Matt
> > >
> > >
> > > Best,
> > > Xiaoyin Ji
> > > --
> > > What most experimenters take for granted before they begin their
> > > experiments is infinitely more interesting than any results to which
> >
> > their
> >
> > > experiments lead. -- Norbert Wiener
> >
> > I had a question on what is the best approach for this. Most of the time
> > is spent inside of BLAS, correct? So wouldn't you maximize your
> > operations by running one MPI/PETSC job per board (per shared memory),
> > and use a multi-threaded BLAS that matches your board? You should cut
> > down communications by some factor proportional to the number of threads
> > per board, and the BLAS itself should better optimize most of your
> > operations across the board, rather than relying on higher order
> > parallelisms.
>
> This is a common misconception. In fact, most time is spent in MatVec or
> BLAS1, neither of which benefit from MT BLAS.
>
>   Matt
>
> > Regards,
> > Alex Peyser

Interesting. At least my misconception is common. 
That makes things tricky with ATLAS, since the number of threads is a 
compile-time constant. I can't imagine it would be a good idea to have an 8x 
BLAS running 8xs simultaneously -- unless the mpi jobs were all 
unsynchronized. It may be only 10-20% of the time, but that's still a large 
overlap of conflicting threads degrading performance.

I'll have to do some benchmarks. Is the 10-20% number still true for fairly 
dense matrices?

Ah, another layer of administration-code may now be required to properly 
allocate jobs.

Alex
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20090616/bbf6ad85/attachment.pgp>