[petsc-dev] MPIX_Iallreduce()
Satish Balay
balay at mcs.anl.gov
Tue Mar 20 08:55:49 CDT 2012
On Tue, 20 Mar 2012, Jed Brown wrote:
> On Tue, Mar 20, 2012 at 00:06, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>
> > On Sun, Mar 18, 2012 at 11:55, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >> Add a glance adding all these new complications to PETSc to chase an
> >> impossible overlap of communication and computation sounds fine :-)
> >
> >
> > /Q has a dedicated thread to drive asynchronous comm. I've added this, the
> > call to PetscCommSplitReductionBegin() is entirely optional (does not alter
> > program semantics), but will allow asynchronous progress to be made.
> >
> > On conventional systems, there are two choices for driving asynchronous
> > progress:
> >
> > 1. Set the environment variable MPICH_ASYNC_PROGRESS=1. "Setting that
> > environment variable will cause a cheesy form of background progress
> > wherein the library will spawn an additional background thread per MPI
> > process. You'll have to play around with things, but I'd recommend cutting
> > your number of processes per node in half to avoid nemesis oversubscription
> > badness." -- Dave Goodell
> >
> > 2. Make any nontrivial calls into the MPI stack. This could specifically
> > mean polling a request, but it could also just be your usual communication.
> > I suspect that a standard MatMult() and PCApply() will be enough to drive a
> > significant amount of progress on the reduction.
> >
> > http://petsc.cs.iit.edu/petsc/petsc-dev/rev/d2d98894cb5c
> >
>
> And here's some preliminary test output on 32 processes of cg.mcs. I ran a
> handful of times with src/vec/vec/examples/tests/ex42.c and these results
> seem reproducible.
>
> VecScatterBegin 300 1.0 8.5473e-04 1.5 0.00e+00 0.0 9.6e+03 4.0e+01
> 0.0e+00 2 0 98 99 0 2 0 98 99 0 0
> VecScatterEnd 300 1.0 1.1136e-02 7.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 18 0 0 0 0 18 0 0 0 0 0
> VecReduceArith 100 1.0 1.4210e-04 1.5 9.00e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00 0100 0 0 0 0100 0 0 0 203
> VecReduceComm 100 1.0 2.1405e-02 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+02 60 0 0 0 89 60 0 0 0 90 0
>
> VecScatterBegin 300 1.0 5.9962e-04 1.9 0.00e+00 0.0 9.6e+03 4.0e+01
> 0.0e+00 5 0 98 99 0 5 0 98 99 0 0
> VecScatterEnd 300 1.0 1.6668e-03 4.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 12 0 0 0 0 12 0 0 0 0 0
> VecReduceArith 100 1.0 1.1611e-04 2.5 9.00e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00 1100 0 0 0 1100 0 0 0 248
> VecReduceBegin 100 1.0 5.3430e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
> VecReduceEnd 100 1.0 2.4858e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 21 0 0 0 0 21 0 0 0 0 0
Jed,
are you pinning the mpi jobs to specific cores for this tests? Does it
make a difference?
I'm curious as this machine has assymetric cores wrt L2/FPU.
Presumably - using p0,p2,p4 etc should spread out the load -
but I don't know if the kernel is doing this automatically.
Satish
More information about the petsc-dev
mailing list