Load Balancing and KSPSolve

Wed Nov 21 10:16:25 CST 2007

If you are just setting local values, then its best to avoid calls to
VecAssembyBegin()/VecAssemblyEnd(). These have calls to
MPI_Allreduce() - eventhough there might not be any communication.

[so with a MPI_Barrier time of 0.00820498sec, 4704 calls to
MPI_Allreduce(), which is similar to a barrier - would add up to many
seconds. In this case it could be most of the 12sec time taken by
VecAssemblyBegin()]

Normally local assembly of a vec is done by accessing the local vector
data, directly and modifying the values.

VecGetArray(vec,&ptr)
ptr[local-dim]= val
VecRestoreArray(vec)

With fortran77, since pointer usageis not possible - there is a
workarround. [check vec/vec/examples/tutorials/ex4f.F for
VecGetArray() usage from F77]. But with F90, you can use
VecGetArrayF90()/VecRestoreArrayF90() [as in ex4f90.F].

However in your case - you might be able to continuing using
VecSetValue(), by just commenting out the calls to
VecAssemblyBegin()/End(). [you might first want to run with -info, to
make sure there is no communiation in VecAssembly]

Satish

On Wed, 21 Nov 2007, Tim Stitt wrote:

> Satish,
> 
> Thanks for your helpful comments. I am unsure why the VecAssembyBegin()
> routine is taking a high percentage of the wall-clock when modifications to
> the parallel vector should be local (all I am doing is working out which
> element in the RHS b vector should be 1 and setting it).
> 
> Here is my loop for iterating through the RHS Identity matrix and setting the
> relevant element to 1...prior to the call to KSPSolve. I then reset that value
> to 0 after the Solve in preparation for the next iteration.
> 
> ! Get vector index range per process
> call VecGetOwnershipRange(B,firstElement,lastElement,error);
> 
> do column=0,rhs-1   ! Loop over RHS columns in Identity Matrix
> 
>     if ((column.ge.firstElement).and.(column.lt.lastElement)) then
>        call VecSetValue(B,column,one,INSERT_VALUES,error)
>     end if
> 
>     call VecAssemblyBegin(B,error)
>     call VecAssemblyEnd(B,error)
> 
>     ! Solve Ax=b
>     call KSPSolve(ksp,b,x,error);!CHKERRQ(error)
> 
>     if ((column.ge.firstElement).and.(column.lt.lastElement)) then
>        call VecSetValue(B,column,zero,INSERT_VALUES,error)
>     end if
> 
>  end do
> 
> Can you identify if I am doing something stupid which could be compromising
> the efficiency of the Assembly routine?
> 
> Thanks again,
> 
> Tim.
> 
> Satish Balay wrote:
> > a couple of comments:
> > 
> > Looks like most of the time is spent in MatSolve(). [90% for np=1]
> > 
> > However on np=8 run, you have MatSolve() taking 42% time, whereas
> > VecAssemblyBegin() taking 32% time. Depending upon whats beeing done
> > with VecSetValues()/VecAssembly() - you might be able to reduce this
> > time considerably. [ If you can generate values locally - then no
> > communication is required. If you need to communicate values - then
> > you can explore VecScatters() for more efficient communication]
> > 
> > Wrt MatSolve() on 8 procs, the max/min time between any 2 procs is
> > 2.6.  [i.e slowest proc is taking 16 sec, so the fastest proc would
> > probably be taking 6 sec.]. The max/min ratio of flops across procs is
> > 1.8. So there is indeed a load balance issue that is contributing to
> > different times on different processors [I guess the slowest proc is
> > doing almost twice the amount of work as the fastest proc].
> > 
> > Satish
> > 
> > On Tue, 20 Nov 2007, Tim Stitt wrote:
> > 
> >   
> > > Satish,
> > > 
> > > Logs attached...hope they help.
> > > 
> > > Thanks,
> > > 
> > > Tim.
> > > 
> > > Satish Balay wrote:
> > >     
> > > > Can you send the -log_summary for your runs [say p=1, p=8]
> > > > 
> > > > Satish
> > > > 
> > > > On Tue, 20 Nov 2007, Tim Stitt wrote:
> > > > 
> > > >         
> > > > > Hi all (again),
> > > > > 
> > > > > I finally got some data back from the KSP PETSc code that I put
> > > > > together
> > > > > to
> > > > > solve this sparse inverse matrix problem I was looking into. Ideally I
> > > > > am
> > > > > aiming for a O(N) (time complexity) approach to getting the first 'k'
> > > > > columns
> > > > > of the inverse of a sparse matrix.
> > > > > 
> > > > > To recap the method: I have my solver which uses KSPSolve in a loop
> > > > > that
> > > > > iterates over the first k columns of an identity matrix B and computes
> > > > > the
> > > > > corresponding x vector.
> > > > > 
> > > > > I am just a bit curious about some of the timings I am
> > > > > obtaining...which I
> > > > > hope someone can explain. Here are the timings I obtained for a global
> > > > > sparse
> > > > > matrix (4704 x 4704) and solving for the first 1176 columns in the
> > > > > identity
> > > > > using P processes (processors) on our cluster.
> > > > > 
> > > > > (Timings are given in seconds for each process performing work in the
> > > > > loop
> > > > > and
> > > > > were obtained by encapsulating the loop with the cpu_time() Fortran
> > > > > intrinsic.
> > > > > The MUMPS package was requested for factorisation/solving, although
> > > > > similar
> > > > > timings were obtained for both the native solver and SUPERLU)
> > > > > 
> > > > > P=1  [30.92]
> > > > > P=2  [15.47, 15.54]
> > > > > >>>> P=4  [4.68, 5.49, 4.67, 5.07]
> > > > > P=8  [2.36, 4,23, 2.81, 2.54, 3.42, 2.22, 1.41, 3.15]
> > > > > P=16 [1.04, 0.45, 1.08, 0.27, 0.87, 0.93, 1.1, 1.06, 0.29, 0.34, 0.73,
> > > > > 0.25,
> > > > > 0.43, 1.09, 1.08, 1.1]
> > > > > 
> > > > > Firstly, I notice very good scalability up to 16 processes...is this
> > > > > expected
> > > > > (by those people who use these solvers regularly)?
> > > > > 
> > > > > Also I notice that the timings per process vary as we scale up. Is
> > > > > this a
> > > > > load-balancing problem related to more non-zero values being on a
> > > > > given
> > > > > processor than others? Once again is this expected?
> > > > > 
> > > > > Please excuse my ignorance of matters relating to these solvers and
> > > > > their
> > > > > operation...as it really isn't my field of expertise.
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Tim.
> > > > > 
> > > > > 
> > > > >             
> > > >         
> > > 
> > >     
> > 
> >   
> 
> 
>