Performance of MatMatSolve

Hong Zhang hzhang at mcs.anl.gov
Mon Mar 16 10:34:17 CDT 2009


David,

Superlu_dist seems sligtly better.
Does mumps crashes during numeric factorization due to memory limitation?
You may try the option
'-mat_mumps_icntl_14 <num>' with num>20
(ICNTL(14): percentage of estimated workspace 
increase, default=20).
Run your code with '-help' to see all available
options.

>From your output
> MatSolve          135030 1.0 3.0340e+03

i.e., you called MatMatSolve() 30 times,
with num of rhs= 135030
(matrix B has 135030/30 columns).
Although superlu_dist and mumps suppport
multiple rhs operation, petsc interface
actually calls
MatSolve() in a loop, which can be accelarated
if petsc interfaces superlu/mumps's MatMatSolve()
directly.
I'll try to add it into the interface and let you know
after I'm done (it might take a while
because I'm tied with other projects).
May I have your calling sequence of using MatMatSolve()?

To me,
the performances of superlu_dist and mumps are reasonable
under current version of petsc library.

Thanks for providing us the data,

Hong


On Sun, 15 Mar 2009, David Fuentes wrote:

> On Sat, 14 Mar 2009, Hong Zhang wrote:
>
>> 
>> David,
>> 
>> Yes, MatMatSolve dominates. Can you also send us the output of
>> '-log_summary' from superlu_dist?
>> 
>> MUMPS only suppports centralized rhs vector b.
>> Thus, we must scatter petsc distributed b into a seqential rhs vector 
>> (stored in root proc) in the petsc interface, which explains why the root 
>> proc takes longer time.
>> I see that the numerical factorization and MatMatSolve are called
>> 30 times.
>> Do you iterate with the sequence similar to
>> for i=0,1, ...
>>   B_i = X_(i-1)
>>   Solve A_i * X_i = B_i
>> 
>> i.e., the rhs B is based on previously computed X?
>
> Hong,
>
> Yes my sequence is similiar to the algorithm above.
>
>
> The numbers I sent were from superlu. I'm seeing pretty similiar
> performance profiles between the two. Sorry, I tried  to get a good
> apples to apples comparison but getting seg faults as I increase
> the # of processors w/ mumps which is why it is ran w/ only 24 procs and
> super lu is w/ 40 procs.
>
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops --- Global ---  --- 
> Stage ---   Total
>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 3: State Update (superlu 40 processors)
>
> VecCopy           135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecWAXPY              30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   840
> VecScatterBegin       30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 
> 0.0e+00  0  0 15  0  0   0  0 50  0  0     0
> VecScatterEnd         30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMultAdd            30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 
> 0.0e+00  0  0 15  0  0   0  0 50  0  0  3679
> MatSolve          135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00 78  0  0  0  0  81  0  0  0  0     0
> MatLUFactorSym        30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum        30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  7  0  0  0  0   7  0  0  0  0     0
> MatConvert           150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.8e+02  0  0  0  0  2   0  0  0  0 30     0
> MatScale              60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2210
> MatAssemblyBegin     180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.4e+02  2  0  0  0  2   2  0  0  0 40     0
> MatAssemblyEnd       180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.4e+02  0  0  0  0  2   0  0  0  0 40     0
> MatGetRow           4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMatMult            30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 
> 2.4e+02 11100 15 97  2  11100 50100 40 12841
> MatMatSolve           30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 6.0e+01 77  0  0  0  1  81  0  0  0 10     0
>
> --- Event Stage 3: State Update (mumps 24 processors)
>
> VecWAXPY              30 1.0 3.5802e-04 2.0 6.00e+03 1.1 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   377
> VecScatterBegin   270090 1.0 2.6040e+0121.1 0.00e+00 0.0 3.1e+06 2.3e+03 
> 0.0e+00  0  0 97  6  0   0  0 99  6  0     0
> VecScatterEnd     135060 1.0 3.7928e+0164.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMultAdd            30 1.0 4.5802e-01 2.3 5.40e+07 1.1 1.7e+04 1.5e+03 
> 0.0e+00  0  0  1  0  0   0  0  1  0  0  2653
> MatSolve          135030 1.0 6.4960e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 
> 1.5e+02 81  0 96  6  0  86  0 99  6  7     0
> MatLUFactorSym        30 1.0 1.0538e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum        30 1.0 4.4708e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.8e+02  6  0  0  0  0   6  0  0  0  9     0
> MatConvert           150 1.0 4.7433e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 6.3e+02  0  0  0  0  0   0  0  0  0 30     0
> MatScale              60 1.0 4.3342e-01 6.7 2.70e+07 1.1 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1402
> MatAssemblyBegin     180 1.0 8.4294e+01 5.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.4e+02  1  0  0  0  0   1  0  0  0 12     0
> MatAssemblyEnd       180 1.0 1.3100e-01 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 4.2e+02  0  0  0  0  0   0  0  0  0 20     0
> MatGetRow           6000 1.1 3.6813e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMatMult            30 1.0 6.1625e+02 1.0 2.43e+11 1.1 1.7e+04 6.8e+06 
> 5.1e+02  8100  1 91  0   8100  1 94 25  8872
> MatMatSolve           30 1.0 6.4946e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 
> 1.2e+02 81  0 96  6  0  86  0 99  6  6     0
> ------------------------------------------------------------------------------------------------------------------------
>
>
>
>
>
>
>> On Sat, 14 Mar 2009, David Fuentes wrote:
>> 
>>> Thanks a lot Hong,
>>> 
>>> The switch definitely seemed to balance the load during the SuperLU 
>>> matmatsolve.
>>> Although I'm not completely sure what I'm seeing. Changing the #dof
>>> also seemed to affect the load balance of the Mumps MatMatSolve.
>>> I need to investigate a bit more.
>>> 
>>> Looking in the profile. The majority of the time is spent in the
>>> MatSolve called by the MatMatSolve.
>>> 
>>> 
>>>
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flops --- Global ---  --- 
>>> Stage ---   Total
>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> VecCopy           135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>>> VecWAXPY              30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 
>>> 0.0e+00  0   0  0  0  0   0  0  0  0  0   840
>>> VecScatterBegin       30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 
>>> 0.0e+00  0   0 15  0  0   0  0 50  0  0     0
>>> VecScatterEnd         30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>>> MatMultAdd            30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 
>>> 0.0e+00  0   0 15  0  0   0  0 50  0  0  3679
>>> MatSolve          135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00 78   0  0  0  0  81  0  0  0  0     0
>>> MatLUFactorSym        30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>>> MatLUFactorNum        30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  7   0  0  0  0   7  0  0  0  0     0
>>> MatConvert           150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 1.8e+02  0   0  0  0  4   0  0  0  0 30     0
>>> MatScale              60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 
>>> 0.0e+00  0   0  0  0  0   0  0  0  0  0  2210
>>> MatAssemblyBegin     180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 2.4e+02 2   0  0  0  5   2  0  0  0 40     0
>>> MatAssemblyEnd       180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 2.4e+02  0   0  0  0  5   0  0  0  0 40     0
>>> MatGetRow           4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0   0  0  0  0   0  0  0  0  0     0
>>> MatMatMult            30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 
>>> 2.4e+02 11 100 15 97  5  11100 50100 40 12841
>>> MatMatSolve           30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 6.0e+01 77   0  0  0  1  81  0  0  0 10     0
>>> 
>>> 
>>> 
>>> df
>>> 
>>> 
>>> 
>>> On Fri, 13 Mar 2009, Hong Zhang wrote:
>>> 
>>>> David,
>>>> 
>>>> You may run with option '-log_summary <log_file>' and
>>>> check which function dominates the time.
>>>> I suspect the symbolic factorization, because it is
>>>> implemented sequentially in mumps.
>>>> 
>>>> If this is the case, you may swich to superlu_dist
>>>> which supports parallel symbolic factorization
>>>> in the latest release.
>>>> 
>>>> Let us know what you get,
>>>> 
>>>> Hong
>>>> 
>>>> On Fri, 13 Mar 2009, David Fuentes wrote:
>>>> 
>>>>> 
>>>>> The majority of time in my code is spent in the MatMatSolve. I'm running 
>>>>> MatMatSolve in parallel using Mumps as the factored matrix.
>>>>> Using top, I've noticed that during the MatMatSolve
>>>>> the majority of the load seems to be on the root process.
>>>>> Is this expected? Or do I most likely have a problem with the matrices 
>>>>> that I'm passing in?
>>>>> 
>>>>> 
>>>>> 
>>>>> thank you,
>>>>> David Fuentes
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>


More information about the petsc-users mailing list