[petsc-dev] MPIX_Iallreduce()

Tue Mar 20 08:55:49 CDT 2012

On Tue, 20 Mar 2012, Jed Brown wrote:

> On Tue, Mar 20, 2012 at 00:06, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> 
> > On Sun, Mar 18, 2012 at 11:55, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >>  Add a glance adding all these new complications to PETSc to chase an
> >> impossible overlap of communication and computation sounds fine :-)
> >
> >
> > /Q has a dedicated thread to drive asynchronous comm. I've added this, the
> > call to PetscCommSplitReductionBegin() is entirely optional (does not alter
> > program semantics), but will allow asynchronous progress to be made.
> >
> > On conventional systems, there are two choices for driving asynchronous
> > progress:
> >
> > 1. Set the environment variable MPICH_ASYNC_PROGRESS=1. "Setting that
> > environment variable will cause a cheesy form of background progress
> > wherein the library will spawn an additional background thread per MPI
> > process.  You'll have to play around with things, but I'd recommend cutting
> > your number of processes per node in half to avoid nemesis oversubscription
> > badness." -- Dave Goodell
> >
> > 2. Make any nontrivial calls into the MPI stack. This could specifically
> > mean polling a request, but it could also just be your usual communication.
> > I suspect that a standard MatMult() and PCApply() will be enough to drive a
> > significant amount of progress on the reduction.
> >
> > http://petsc.cs.iit.edu/petsc/petsc-dev/rev/d2d98894cb5c
> >
> 
> And here's some preliminary test output on 32 processes of cg.mcs. I ran a
> handful of times with src/vec/vec/examples/tests/ex42.c and these results
> seem reproducible.
> 
> VecScatterBegin      300 1.0 8.5473e-04 1.5 0.00e+00 0.0 9.6e+03 4.0e+01
> 0.0e+00  2  0 98 99  0   2  0 98 99  0     0
> VecScatterEnd        300 1.0 1.1136e-02 7.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 18  0  0  0  0  18  0  0  0  0     0
> VecReduceArith       100 1.0 1.4210e-04 1.5 9.00e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00  0100  0  0  0   0100  0  0  0   203
> VecReduceComm        100 1.0 2.1405e-02 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+02 60  0  0  0 89  60  0  0  0 90     0
> 
> VecScatterBegin      300 1.0 5.9962e-04 1.9 0.00e+00 0.0 9.6e+03 4.0e+01
> 0.0e+00  5  0 98 99  0   5  0 98 99  0     0
> VecScatterEnd        300 1.0 1.6668e-03 4.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
> VecReduceArith       100 1.0 1.1611e-04 2.5 9.00e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00  1100  0  0  0   1100  0  0  0   248
> VecReduceBegin       100 1.0 5.3430e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
> VecReduceEnd         100 1.0 2.4858e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 21  0  0  0  0  21  0  0  0  0     0

Jed,

are you pinning the mpi jobs to specific cores for this tests? Does it
make a difference?

I'm curious as this machine has assymetric cores wrt L2/FPU.
Presumably - using p0,p2,p4 etc should spread out the load -
but I don't know if the kernel is doing this automatically.

Satish