Load Balancing and KSPSolve

Wed Nov 21 04:52:05 CST 2007

Satish,

Thanks for your helpful comments. I am unsure why the VecAssembyBegin() 
routine is taking a high percentage of the wall-clock when modifications 
to the parallel vector should be local (all I am doing is working out 
which element in the RHS b vector should be 1 and setting it).

Here is my loop for iterating through the RHS Identity matrix and 
setting the relevant element to 1...prior to the call to KSPSolve. I 
then reset that value to 0 after the Solve in preparation for the next 
iteration.

! Get vector index range per process
call VecGetOwnershipRange(B,firstElement,lastElement,error);

do column=0,rhs-1   ! Loop over RHS columns in Identity Matrix

     if ((column.ge.firstElement).and.(column.lt.lastElement)) then
        call VecSetValue(B,column,one,INSERT_VALUES,error)
     end if

     call VecAssemblyBegin(B,error)
     call VecAssemblyEnd(B,error)

     ! Solve Ax=b
     call KSPSolve(ksp,b,x,error);!CHKERRQ(error)

     if ((column.ge.firstElement).and.(column.lt.lastElement)) then
        call VecSetValue(B,column,zero,INSERT_VALUES,error)
     end if

  end do

Can you identify if I am doing something stupid which could be 
compromising the efficiency of the Assembly routine?

Thanks again,

Tim.

Satish Balay wrote:
> a couple of comments:
>
> Looks like most of the time is spent in MatSolve(). [90% for np=1]
>
> However on np=8 run, you have MatSolve() taking 42% time, whereas
> VecAssemblyBegin() taking 32% time. Depending upon whats beeing done
> with VecSetValues()/VecAssembly() - you might be able to reduce this
> time considerably. [ If you can generate values locally - then no
> communication is required. If you need to communicate values - then
> you can explore VecScatters() for more efficient communication]
>
> Wrt MatSolve() on 8 procs, the max/min time between any 2 procs is
> 2.6.  [i.e slowest proc is taking 16 sec, so the fastest proc would
> probably be taking 6 sec.]. The max/min ratio of flops across procs is
> 1.8. So there is indeed a load balance issue that is contributing to
> different times on different processors [I guess the slowest proc is
> doing almost twice the amount of work as the fastest proc].
>
> Satish
>
> On Tue, 20 Nov 2007, Tim Stitt wrote:
>
>   
>> Satish,
>>
>> Logs attached...hope they help.
>>
>> Thanks,
>>
>> Tim.
>>
>> Satish Balay wrote:
>>     
>>> Can you send the -log_summary for your runs [say p=1, p=8]
>>>
>>> Satish
>>>
>>> On Tue, 20 Nov 2007, Tim Stitt wrote:
>>>
>>>   
>>>       
>>>> Hi all (again),
>>>>
>>>> I finally got some data back from the KSP PETSc code that I put together
>>>> to
>>>> solve this sparse inverse matrix problem I was looking into. Ideally I am
>>>> aiming for a O(N) (time complexity) approach to getting the first 'k'
>>>> columns
>>>> of the inverse of a sparse matrix.
>>>>
>>>> To recap the method: I have my solver which uses KSPSolve in a loop that
>>>> iterates over the first k columns of an identity matrix B and computes the
>>>> corresponding x vector.
>>>>
>>>> I am just a bit curious about some of the timings I am obtaining...which I
>>>> hope someone can explain. Here are the timings I obtained for a global
>>>> sparse
>>>> matrix (4704 x 4704) and solving for the first 1176 columns in the
>>>> identity
>>>> using P processes (processors) on our cluster.
>>>>
>>>> (Timings are given in seconds for each process performing work in the loop
>>>> and
>>>> were obtained by encapsulating the loop with the cpu_time() Fortran
>>>> intrinsic.
>>>> The MUMPS package was requested for factorisation/solving, although
>>>> similar
>>>> timings were obtained for both the native solver and SUPERLU)
>>>>
>>>> P=1  [30.92]
>>>> P=2  [15.47, 15.54]
>>>> P=4  [4.68, 5.49, 4.67, 5.07]
>>>> P=8  [2.36, 4,23, 2.81, 2.54, 3.42, 2.22, 1.41, 3.15]
>>>> P=16 [1.04, 0.45, 1.08, 0.27, 0.87, 0.93, 1.1, 1.06, 0.29, 0.34, 0.73,
>>>> 0.25,
>>>> 0.43, 1.09, 1.08, 1.1]
>>>>
>>>> Firstly, I notice very good scalability up to 16 processes...is this
>>>> expected
>>>> (by those people who use these solvers regularly)?
>>>>
>>>> Also I notice that the timings per process vary as we scale up. Is this a
>>>> load-balancing problem related to more non-zero values being on a given
>>>> processor than others? Once again is this expected?
>>>>
>>>> Please excuse my ignorance of matters relating to these solvers and their
>>>> operation...as it really isn't my field of expertise.
>>>>
>>>> Regards,
>>>>
>>>> Tim.
>>>>
>>>>
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>   

-- 
Dr. Timothy Stitt <timothy_dot_stitt_at_ichec.ie>
HPC Application Consultant - ICHEC (www.ichec.ie)

Dublin Institute for Advanced Studies
5 Merrion Square - Dublin 2 - Ireland

+353-1-6621333 (tel) / +353-1-6621477 (fax)