Load Balancing and KSPSolve

Tue Nov 20 20:17:27 CST 2007

a couple of comments:

Looks like most of the time is spent in MatSolve(). [90% for np=1]

However on np=8 run, you have MatSolve() taking 42% time, whereas
VecAssemblyBegin() taking 32% time. Depending upon whats beeing done
with VecSetValues()/VecAssembly() - you might be able to reduce this
time considerably. [ If you can generate values locally - then no
communication is required. If you need to communicate values - then
you can explore VecScatters() for more efficient communication]

Wrt MatSolve() on 8 procs, the max/min time between any 2 procs is
2.6.  [i.e slowest proc is taking 16 sec, so the fastest proc would
probably be taking 6 sec.]. The max/min ratio of flops across procs is
1.8. So there is indeed a load balance issue that is contributing to
different times on different processors [I guess the slowest proc is
doing almost twice the amount of work as the fastest proc].

Satish

On Tue, 20 Nov 2007, Tim Stitt wrote:

> Satish,
> 
> Logs attached...hope they help.
> 
> Thanks,
> 
> Tim.
> 
> Satish Balay wrote:
> > Can you send the -log_summary for your runs [say p=1, p=8]
> > 
> > Satish
> > 
> > On Tue, 20 Nov 2007, Tim Stitt wrote:
> > 
> >   
> > > Hi all (again),
> > > 
> > > I finally got some data back from the KSP PETSc code that I put together
> > > to
> > > solve this sparse inverse matrix problem I was looking into. Ideally I am
> > > aiming for a O(N) (time complexity) approach to getting the first 'k'
> > > columns
> > > of the inverse of a sparse matrix.
> > > 
> > > To recap the method: I have my solver which uses KSPSolve in a loop that
> > > iterates over the first k columns of an identity matrix B and computes the
> > > corresponding x vector.
> > > 
> > > I am just a bit curious about some of the timings I am obtaining...which I
> > > hope someone can explain. Here are the timings I obtained for a global
> > > sparse
> > > matrix (4704 x 4704) and solving for the first 1176 columns in the
> > > identity
> > > using P processes (processors) on our cluster.
> > > 
> > > (Timings are given in seconds for each process performing work in the loop
> > > and
> > > were obtained by encapsulating the loop with the cpu_time() Fortran
> > > intrinsic.
> > > The MUMPS package was requested for factorisation/solving, although
> > > similar
> > > timings were obtained for both the native solver and SUPERLU)
> > > 
> > > P=1  [30.92]
> > > P=2  [15.47, 15.54]
> > > P=4  [4.68, 5.49, 4.67, 5.07]
> > > P=8  [2.36, 4,23, 2.81, 2.54, 3.42, 2.22, 1.41, 3.15]
> > > P=16 [1.04, 0.45, 1.08, 0.27, 0.87, 0.93, 1.1, 1.06, 0.29, 0.34, 0.73,
> > > 0.25,
> > > 0.43, 1.09, 1.08, 1.1]
> > > 
> > > Firstly, I notice very good scalability up to 16 processes...is this
> > > expected
> > > (by those people who use these solvers regularly)?
> > > 
> > > Also I notice that the timings per process vary as we scale up. Is this a
> > > load-balancing problem related to more non-zero values being on a given
> > > processor than others? Once again is this expected?
> > > 
> > > Please excuse my ignorance of matters relating to these solvers and their
> > > operation...as it really isn't my field of expertise.
> > > 
> > > Regards,
> > > 
> > > Tim.
> > > 
> > > 
> > >     
> > 
> >   
> 
> 
>