Performance of MatMatSolve

Sun Mar 15 13:27:27 CDT 2009

On Sat, 14 Mar 2009, Hong Zhang wrote:

>
> David,
>
> Yes, MatMatSolve dominates. Can you also send us the output of
> '-log_summary' from superlu_dist?
>
> MUMPS only suppports centralized rhs vector b.
> Thus, we must scatter petsc distributed b into a seqential rhs vector (stored 
> in root proc) in the petsc interface, which explains why the root proc takes 
> longer time.
> I see that the numerical factorization and MatMatSolve are called
> 30 times.
> Do you iterate with the sequence similar to
> for i=0,1, ...
>   B_i = X_(i-1)
>   Solve A_i * X_i = B_i
>
> i.e., the rhs B is based on previously computed X?

Hong,

Yes my sequence is similiar to the algorithm above.

The numbers I sent were from superlu. I'm seeing pretty similiar
performance profiles between the two. Sorry, I tried  to get a good
apples to apples comparison but getting seg faults as I increase
the # of processors w/ mumps which is why it is ran w/ only 24 procs and
super lu is w/ 40 procs.

------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 3: State Update (superlu 40 processors)

VecCopy           135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecWAXPY              30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   840
VecScatterBegin       30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 0.0e+00  0  0 15  0  0   0  0 50  0  0     0
VecScatterEnd         30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMultAdd            30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 0.0e+00  0  0 15  0  0   0  0 50  0  0  3679
MatSolve          135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 78  0  0  0  0  81  0  0  0  0     0
MatLUFactorSym        30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum        30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  7  0  0  0  0   7  0  0  0  0     0
MatConvert           150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+02  0  0  0  0  2   0  0  0  0 30     0
MatScale              60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2210
MatAssemblyBegin     180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+02  2  0  0  0  2   2  0  0  0 40     0
MatAssemblyEnd       180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+02  0  0  0  0  2   0  0  0  0 40     0
MatGetRow           4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMatMult            30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 2.4e+02 11100 15 97  2  11100 50100 40 12841
MatMatSolve           30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+01 77  0  0  0  1  81  0  0  0 10     0

--- Event Stage 3: State Update (mumps 24 processors)

VecWAXPY              30 1.0 3.5802e-04 2.0 6.00e+03 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   377
VecScatterBegin   270090 1.0 2.6040e+0121.1 0.00e+00 0.0 3.1e+06 2.3e+03 0.0e+00  0  0 97  6  0   0  0 99  6  0     0
VecScatterEnd     135060 1.0 3.7928e+0164.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMultAdd            30 1.0 4.5802e-01 2.3 5.40e+07 1.1 1.7e+04 1.5e+03 0.0e+00  0  0  1  0  0   0  0  1  0  0  2653
MatSolve          135030 1.0 6.4960e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 1.5e+02 81  0 96  6  0  86  0 99  6  7     0
MatLUFactorSym        30 1.0 1.0538e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum        30 1.0 4.4708e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+02  6  0  0  0  0   6  0  0  0  9     0
MatConvert           150 1.0 4.7433e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 6.3e+02  0  0  0  0  0   0  0  0  0 30     0
MatScale              60 1.0 4.3342e-01 6.7 2.70e+07 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1402
MatAssemblyBegin     180 1.0 8.4294e+01 5.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+02  1  0  0  0  0   1  0  0  0 12     0
MatAssemblyEnd       180 1.0 1.3100e-01 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 4.2e+02  0  0  0  0  0   0  0  0  0 20     0
MatGetRow           6000 1.1 3.6813e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMatMult            30 1.0 6.1625e+02 1.0 2.43e+11 1.1 1.7e+04 6.8e+06 5.1e+02  8100  1 91  0   8100  1 94 25  8872
MatMatSolve           30 1.0 6.4946e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 1.2e+02 81  0 96  6  0  86  0 99  6  6     0
------------------------------------------------------------------------------------------------------------------------

> On Sat, 14 Mar 2009, David Fuentes wrote:
>
>> Thanks a lot Hong,
>> 
>> The switch definitely seemed to balance the load during the SuperLU 
>> matmatsolve.
>> Although I'm not completely sure what I'm seeing. Changing the #dof
>> also seemed to affect the load balance of the Mumps MatMatSolve.
>> I need to investigate a bit more.
>> 
>> Looking in the profile. The majority of the time is spent in the
>> MatSolve called by the MatMatSolve.
>> 
>>
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flops --- Global ---  --- 
>> Stage ---   Total
>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> VecCopy           135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>> VecWAXPY              30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 
>> 0.0e+00  0   0  0  0  0   0  0  0  0  0   840
>> VecScatterBegin       30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 
>> 0.0e+00  0   0 15  0  0   0  0 50  0  0     0
>> VecScatterEnd         30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>> MatMultAdd            30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 
>> 0.0e+00  0   0 15  0  0   0  0 50  0  0  3679
>> MatSolve          135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00 78   0  0  0  0  81  0  0  0  0     0
>> MatLUFactorSym        30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>> MatLUFactorNum        30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  7   0  0  0  0   7  0  0  0  0     0
>> MatConvert           150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 1.8e+02  0   0  0  0  4   0  0  0  0 30     0
>> MatScale              60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 
>> 0.0e+00  0   0  0  0  0   0  0  0  0  0  2210
>> MatAssemblyBegin     180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 2.4e+02 2   0  0  0  5   2  0  0  0 40     0
>> MatAssemblyEnd       180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 2.4e+02  0   0  0  0  5   0  0  0  0 40     0
>> MatGetRow           4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>> MatMatMult            30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 
>> 2.4e+02 11 100 15 97  5  11100 50100 40 12841
>> MatMatSolve           30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 6.0e+01 77   0  0  0  1  81  0  0  0 10     0
>> 
>> 
>> 
>> df
>> 
>> 
>> 
>> On Fri, 13 Mar 2009, Hong Zhang wrote:
>> 
>>> David,
>>> 
>>> You may run with option '-log_summary <log_file>' and
>>> check which function dominates the time.
>>> I suspect the symbolic factorization, because it is
>>> implemented sequentially in mumps.
>>> 
>>> If this is the case, you may swich to superlu_dist
>>> which supports parallel symbolic factorization
>>> in the latest release.
>>> 
>>> Let us know what you get,
>>> 
>>> Hong
>>> 
>>> On Fri, 13 Mar 2009, David Fuentes wrote:
>>> 
>>>> 
>>>> The majority of time in my code is spent in the MatMatSolve. I'm running 
>>>> MatMatSolve in parallel using Mumps as the factored matrix.
>>>> Using top, I've noticed that during the MatMatSolve
>>>> the majority of the load seems to be on the root process.
>>>> Is this expected? Or do I most likely have a problem with the matrices 
>>>> that I'm passing in?
>>>> 
>>>> 
>>>> 
>>>> thank you,
>>>> David Fuentes
>>>> 
>>>> 
>>> 
>> 
>