Performance of MatMatSolve
David Fuentes
fuentesdt at gmail.com
Sun Mar 15 13:27:27 CDT 2009
On Sat, 14 Mar 2009, Hong Zhang wrote:
>
> David,
>
> Yes, MatMatSolve dominates. Can you also send us the output of
> '-log_summary' from superlu_dist?
>
> MUMPS only suppports centralized rhs vector b.
> Thus, we must scatter petsc distributed b into a seqential rhs vector (stored
> in root proc) in the petsc interface, which explains why the root proc takes
> longer time.
> I see that the numerical factorization and MatMatSolve are called
> 30 times.
> Do you iterate with the sequence similar to
> for i=0,1, ...
> B_i = X_(i-1)
> Solve A_i * X_i = B_i
>
> i.e., the rhs B is based on previously computed X?
Hong,
Yes my sequence is similiar to the algorithm above.
The numbers I sent were from superlu. I'm seeing pretty similiar
performance profiles between the two. Sorry, I tried to get a good
apples to apples comparison but getting seg faults as I increase
the # of processors w/ mumps which is why it is ran w/ only 24 procs and
super lu is w/ 40 procs.
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 3: State Update (superlu 40 processors)
VecCopy 135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecWAXPY 30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 840
VecScatterBegin 30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02 0.0e+00 0 0 15 0 0 0 0 50 0 0 0
VecScatterEnd 30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMultAdd 30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02 0.0e+00 0 0 15 0 0 0 0 50 0 0 3679
MatSolve 135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 78 0 0 0 0 81 0 0 0 0 0
MatLUFactorSym 30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 0
MatConvert 150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+02 0 0 0 0 2 0 0 0 0 30 0
MatScale 60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2210
MatAssemblyBegin 180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+02 2 0 0 0 2 2 0 0 0 40 0
MatAssemblyEnd 180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+02 0 0 0 0 2 0 0 0 0 40 0
MatGetRow 4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMatMult 30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06 2.4e+02 11100 15 97 2 11100 50100 40 12841
MatMatSolve 30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+01 77 0 0 0 1 81 0 0 0 10 0
--- Event Stage 3: State Update (mumps 24 processors)
VecWAXPY 30 1.0 3.5802e-04 2.0 6.00e+03 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 377
VecScatterBegin 270090 1.0 2.6040e+0121.1 0.00e+00 0.0 3.1e+06 2.3e+03 0.0e+00 0 0 97 6 0 0 0 99 6 0 0
VecScatterEnd 135060 1.0 3.7928e+0164.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMultAdd 30 1.0 4.5802e-01 2.3 5.40e+07 1.1 1.7e+04 1.5e+03 0.0e+00 0 0 1 0 0 0 0 1 0 0 2653
MatSolve 135030 1.0 6.4960e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 1.5e+02 81 0 96 6 0 86 0 99 6 7 0
MatLUFactorSym 30 1.0 1.0538e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatLUFactorNum 30 1.0 4.4708e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.8e+02 6 0 0 0 0 6 0 0 0 9 0
MatConvert 150 1.0 4.7433e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 6.3e+02 0 0 0 0 0 0 0 0 0 30 0
MatScale 60 1.0 4.3342e-01 6.7 2.70e+07 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1402
MatAssemblyBegin 180 1.0 8.4294e+01 5.9 0.00e+00 0.0 0.0e+00 0.0e+00 2.4e+02 1 0 0 0 0 1 0 0 0 12 0
MatAssemblyEnd 180 1.0 1.3100e-01 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 4.2e+02 0 0 0 0 0 0 0 0 0 20 0
MatGetRow 6000 1.1 3.6813e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMatMult 30 1.0 6.1625e+02 1.0 2.43e+11 1.1 1.7e+04 6.8e+06 5.1e+02 8100 1 91 0 8100 1 94 25 8872
MatMatSolve 30 1.0 6.4946e+03 1.0 0.00e+00 0.0 3.1e+06 2.3e+03 1.2e+02 81 0 96 6 0 86 0 99 6 6 0
------------------------------------------------------------------------------------------------------------------------
> On Sat, 14 Mar 2009, David Fuentes wrote:
>
>> Thanks a lot Hong,
>>
>> The switch definitely seemed to balance the load during the SuperLU
>> matmatsolve.
>> Although I'm not completely sure what I'm seeing. Changing the #dof
>> also seemed to affect the load balance of the Mumps MatMatSolve.
>> I need to investigate a bit more.
>>
>> Looking in the profile. The majority of the time is spent in the
>> MatSolve called by the MatMatSolve.
>>
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Event Count Time (sec) Flops --- Global --- ---
>> Stage --- Total
>> Max Ratio Max Ratio Max Ratio Mess Avg len
>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> VecCopy 135030 1.0 6.3319e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecWAXPY 30 1.0 1.6069e-04 1.9 4.32e+03 1.7 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 840
>> VecScatterBegin 30 1.0 7.6072e-03 1.5 0.00e+00 0.0 4.7e+04 9.0e+02
>> 0.0e+00 0 0 15 0 0 0 0 50 0 0 0
>> VecScatterEnd 30 1.0 9.1272e-02 6.8 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatMultAdd 30 1.0 3.3028e-01 1.4 3.89e+07 1.7 4.7e+04 9.0e+02
>> 0.0e+00 0 0 15 0 0 0 0 50 0 0 3679
>> MatSolve 135030 1.0 3.0340e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 78 0 0 0 0 81 0 0 0 0 0
>> MatLUFactorSym 30 1.0 2.2563e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatLUFactorNum 30 1.0 2.7990e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 7 0 0 0 0 7 0 0 0 0 0
>> MatConvert 150 1.0 2.9276e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.8e+02 0 0 0 0 4 0 0 0 0 30 0
>> MatScale 60 1.0 2.7492e-01 1.9 1.94e+07 1.7 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2210
>> MatAssemblyBegin 180 1.0 1.1748e+02236.9 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.4e+02 2 0 0 0 5 2 0 0 0 40 0
>> MatAssemblyEnd 180 1.0 1.9992e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.4e+02 0 0 0 0 5 0 0 0 0 40 0
>> MatGetRow 4320 1.7 2.2634e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatMatMult 30 1.0 4.2578e+02 1.0 1.75e+11 1.7 4.7e+04 4.0e+06
>> 2.4e+02 11 100 15 97 5 11100 50100 40 12841
>> MatMatSolve 30 1.0 3.0256e+03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 6.0e+01 77 0 0 0 1 81 0 0 0 10 0
>>
>>
>>
>> df
>>
>>
>>
>> On Fri, 13 Mar 2009, Hong Zhang wrote:
>>
>>> David,
>>>
>>> You may run with option '-log_summary <log_file>' and
>>> check which function dominates the time.
>>> I suspect the symbolic factorization, because it is
>>> implemented sequentially in mumps.
>>>
>>> If this is the case, you may swich to superlu_dist
>>> which supports parallel symbolic factorization
>>> in the latest release.
>>>
>>> Let us know what you get,
>>>
>>> Hong
>>>
>>> On Fri, 13 Mar 2009, David Fuentes wrote:
>>>
>>>>
>>>> The majority of time in my code is spent in the MatMatSolve. I'm running
>>>> MatMatSolve in parallel using Mumps as the factored matrix.
>>>> Using top, I've noticed that during the MatMatSolve
>>>> the majority of the load seems to be on the root process.
>>>> Is this expected? Or do I most likely have a problem with the matrices
>>>> that I'm passing in?
>>>>
>>>>
>>>>
>>>> thank you,
>>>> David Fuentes
>>>>
>>>>
>>>
>>
>
More information about the petsc-users
mailing list