Performance of MatMatSolve

Sat Mar 14 17:00:18 CDT 2009

David,

Yes, MatMatSolve dominates. Can you also send us the output of
'-log_summary' from superlu_dist?

MUMPS only suppports centralized rhs vector b.
Thus, we must scatter petsc distributed b into a seqential 
rhs vector (stored in root proc) in the petsc interface, which 
explains why the root proc takes longer time.
I see that the numerical factorization and MatMatSolve are called
30 times.
Do you iterate with the sequence similar to
  for i=0,1, ...
    B_i = X_(i-1)
    Solve A_i * X_i = B_i

i.e., the rhs B is based on previously computed X?
If this is the case, we should take sequential output X (mumps has this 
option)
and feed it into next iteration
without mpi scattering.

Hong

On Sat, 14 Mar 2009, David Fuentes wrote:

> Thanks a lot Hong,
>
> The switch definitely seemed to balance the load during the SuperLU 
> matmatsolve.
> Although I'm not completely sure what I'm seeing. Changing the #dof
> also seemed to affect the load balance of the Mumps MatMatSolve.
> I need to investigate a bit more.
>
> Looking in the profile. The majority of the time is spent in the
> MatSolve called by the MatMatSolve.
>
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops --- Global ---  --- 
> Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> VecCopy           135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
> VecWAXPY              30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 
> 0.0e+00  0   0  0  0  0   0  0  0  0  0   840
> VecScatterBegin       30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 
> 0.0e+00  0   0 15  0  0   0  0 50  0  0     0
> VecScatterEnd         30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
> MatMultAdd            30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 
> 0.0e+00  0   0 15  0  0   0  0 50  0  0  3679
> MatSolve          135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00 78   0  0  0  0  81  0  0  0  0     0
> MatLUFactorSym        30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum        30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  7   0  0  0  0   7  0  0  0  0     0
> MatConvert           150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.8e+02  0   0  0  0  4   0  0  0  0 30     0
> MatScale              60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 
> 0.0e+00  0   0  0  0  0   0  0  0  0  0  2210
> MatAssemblyBegin     180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.4e+02 2   0  0  0  5   2  0  0  0 40     0
> MatAssemblyEnd       180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.4e+02  0   0  0  0  5   0  0  0  0 40     0
> MatGetRow           4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
> MatMatMult            30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 
> 2.4e+02 11 100 15 97  5  11100 50100 40 12841
> MatMatSolve           30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 6.0e+01 77   0  0  0  1  81  0  0  0 10     0
>
>
>
> df
>
>
>
> On Fri, 13 Mar 2009, Hong Zhang wrote:
>
>> David,
>> 
>> You may run with option '-log_summary <log_file>' and
>> check which function dominates the time.
>> I suspect the symbolic factorization, because it is
>> implemented sequentially in mumps.
>> 
>> If this is the case, you may swich to superlu_dist
>> which supports parallel symbolic factorization
>> in the latest release.
>> 
>> Let us know what you get,
>> 
>> Hong
>> 
>> On Fri, 13 Mar 2009, David Fuentes wrote:
>> 
>>> 
>>> The majority of time in my code is spent in the MatMatSolve. I'm running 
>>> MatMatSolve in parallel using Mumps as the factored matrix.
>>> Using top, I've noticed that during the MatMatSolve
>>> the majority of the load seems to be on the root process.
>>> Is this expected? Or do I most likely have a problem with the matrices 
>>> that I'm passing in?
>>> 
>>> 
>>> 
>>> thank you,
>>> David Fuentes
>>> 
>>> 
>> 
>