Performance of MatMatSolve
Hong Zhang
hzhang at mcs.anl.gov
Mon Mar 16 10:34:17 CDT 2009
David,
Superlu_dist seems sligtly better.
Does mumps crashes during numeric factorization due to memory limitation?
You may try the option
'-mat_mumps_icntl_14 <num>' with num>20
(ICNTL(14): percentage of estimated workspace
increase, default=20).
Run your code with '-help' to see all available
options.
>From your output
> MatSolve 135030 1.0 3.0340e+03
i.e., you called MatMatSolve() 30 times,
with num of rhs= 135030
(matrix B has 135030/30 columns).
Although superlu_dist and mumps suppport
multiple rhs operation, petsc interface
actually calls
MatSolve() in a loop, which can be accelarated
if petsc interfaces superlu/mumps's MatMatSolve()
directly.
I'll try to add it into the interface and let you know
after I'm done (it might take a while
because I'm tied with other projects).
May I have your calling sequence of using MatMatSolve()?
To me,
the performances of superlu_dist and mumps are reasonable
under current version of petsc library.
Thanks for providing us the data,
Hong
On Sun, 15 Mar 2009, David Fuentes wrote:
> On Sat, 14 Mar 2009, Hong Zhang wrote:
>
>>
>> David,
>>
>> Yes, MatMatSolve dominates. Can you also send us the output of
>> '-log_summary' from superlu_dist?
>>
>> MUMPS only suppports centralized rhs vector b.
>> Thus, we must scatter petsc distributed b into a seqential rhs vector
>> (stored in root proc) in the petsc interface, which explains why the root
>> proc takes longer time.
>> I see that the numerical factorization and MatMatSolve are called
>> 30 times.
>> Do you iterate with the sequence similar to
>> for i=0,1, ...
>> B_i = X_(i-1)
>> Solve A_i * X_i = B_i
>>
>> i.e., the rhs B is based on previously computed X?
>
> Hong,
>
> Yes my sequence is similiar to the algorithm above.
>
>
> The numbers I sent were from superlu. I'm seeing pretty similiar
> performance profiles between the two. Sorry, I tried to get a good
> apples to apples comparison but getting seg faults as I increase
> the # of processors w/ mumps which is why it is ran w/ only 24 procs and
> super lu is w/ 40 procs.
>
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flops --- Global --- ---
> Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 3: State Update (superlu 40 processors)
>
> VecCopy 135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecWAXPY 30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 840
> VecScatterBegin 30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02
> 0.0e+00 0 0 15 0 0 0 0 50 0 0 0
> VecScatterEnd 30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMultAdd 30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02
> 0.0e+00 0 0 15 0 0 0 0 50 0 0 3679
> MatSolve 135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 78 0 0 0 0 81 0 0 0 0 0
> MatLUFactorSym 30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLUFactorNum 30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 7 0 0 0 0 7 0 0 0 0 0
> MatConvert 150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+02 0 0 0 0 2 0 0 0 0 30 0
> MatScale 60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2210
> MatAssemblyBegin 180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.4e+02 2 0 0 0 2 2 0 0 0 40 0
> MatAssemblyEnd 180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.4e+02 0 0 0 0 2 0 0 0 0 40 0
> MatGetRow 4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMatMult 30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06
> 2.4e+02 11100 15 97 2 11100 50100 40 12841
> MatMatSolve 30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+01 77 0 0 0 1 81 0 0 0 10 0
>
> --- Event Stage 3: State Update (mumps 24 processors)
>
> VecWAXPY 30 1.0 3.5802e-04 2.0 6.00e+03 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 377
> VecScatterBegin 270090 1.0 2.6040e+0121.1 0.00e+00 0.0 3.1e+06 2.3e+03
> 0.0e+00 0 0 97 6 0 0 0 99 6 0 0
> VecScatterEnd 135060 1.0 3.7928e+0164.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMultAdd 30 1.0 4.5802e-01 2.3 5.40e+07 1.1 1.7e+04 1.5e+03
> 0.0e+00 0 0 1 0 0 0 0 1 0 0 2653
> MatSolve 135030 1.0 6.4960e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03
> 1.5e+02 81 0 96 6 0 86 0 99 6 7 0
> MatLUFactorSym 30 1.0 1.0538e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLUFactorNum 30 1.0 4.4708e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+02 6 0 0 0 0 6 0 0 0 9 0
> MatConvert 150 1.0 4.7433e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.3e+02 0 0 0 0 0 0 0 0 0 30 0
> MatScale 60 1.0 4.3342e-01 6.7 2.70e+07 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 1402
> MatAssemblyBegin 180 1.0 8.4294e+01 5.9 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.4e+02 1 0 0 0 0 1 0 0 0 12 0
> MatAssemblyEnd 180 1.0 1.3100e-01 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 4.2e+02 0 0 0 0 0 0 0 0 0 20 0
> MatGetRow 6000 1.1 3.6813e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMatMult 30 1.0 6.1625e+02 1.0 2.43e+11 1.1 1.7e+04 6.8e+06
> 5.1e+02 8100 1 91 0 8100 1 94 25 8872
> MatMatSolve 30 1.0 6.4946e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03
> 1.2e+02 81 0 96 6 0 86 0 99 6 6 0
> ------------------------------------------------------------------------------------------------------------------------
>
>
>
>
>
>
>> On Sat, 14 Mar 2009, David Fuentes wrote:
>>
>>> Thanks a lot Hong,
>>>
>>> The switch definitely seemed to balance the load during the SuperLU
>>> matmatsolve.
>>> Although I'm not completely sure what I'm seeing. Changing the #dof
>>> also seemed to affect the load balance of the Mumps MatMatSolve.
>>> I need to investigate a bit more.
>>>
>>> Looking in the profile. The majority of the time is spent in the
>>> MatSolve called by the MatMatSolve.
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event Count Time (sec) Flops --- Global --- ---
>>> Stage --- Total
>>> Max Ratio Max Ratio Max Ratio Mess Avg len
>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> VecCopy 135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecWAXPY 30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 840
>>> VecScatterBegin 30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02
>>> 0.0e+00 0 0 15 0 0 0 0 50 0 0 0
>>> VecScatterEnd 30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatMultAdd 30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02
>>> 0.0e+00 0 0 15 0 0 0 0 50 0 0 3679
>>> MatSolve 135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 78 0 0 0 0 81 0 0 0 0 0
>>> MatLUFactorSym 30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatLUFactorNum 30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 7 0 0 0 0 7 0 0 0 0 0
>>> MatConvert 150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 1.8e+02 0 0 0 0 4 0 0 0 0 30 0
>>> MatScale 60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2210
>>> MatAssemblyBegin 180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 2.4e+02 2 0 0 0 5 2 0 0 0 40 0
>>> MatAssemblyEnd 180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 2.4e+02 0 0 0 0 5 0 0 0 0 40 0
>>> MatGetRow 4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatMatMult 30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06
>>> 2.4e+02 11 100 15 97 5 11100 50100 40 12841
>>> MatMatSolve 30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 6.0e+01 77 0 0 0 1 81 0 0 0 10 0
>>>
>>>
>>>
>>> df
>>>
>>>
>>>
>>> On Fri, 13 Mar 2009, Hong Zhang wrote:
>>>
>>>> David,
>>>>
>>>> You may run with option '-log_summary <log_file>' and
>>>> check which function dominates the time.
>>>> I suspect the symbolic factorization, because it is
>>>> implemented sequentially in mumps.
>>>>
>>>> If this is the case, you may swich to superlu_dist
>>>> which supports parallel symbolic factorization
>>>> in the latest release.
>>>>
>>>> Let us know what you get,
>>>>
>>>> Hong
>>>>
>>>> On Fri, 13 Mar 2009, David Fuentes wrote:
>>>>
>>>>>
>>>>> The majority of time in my code is spent in the MatMatSolve. I'm running
>>>>> MatMatSolve in parallel using Mumps as the factored matrix.
>>>>> Using top, I've noticed that during the MatMatSolve
>>>>> the majority of the load seems to be on the root process.
>>>>> Is this expected? Or do I most likely have a problem with the matrices
>>>>> that I'm passing in?
>>>>>
>>>>>
>>>>>
>>>>> thank you,
>>>>> David Fuentes
>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the petsc-users
mailing list