profiling PETSc code

Barry Smith bsmith at mcs.anl.gov
Tue Aug 15 12:52:02 CDT 2006



> MatSolve           16000 1.0 1.1285e+02 1.4 1.50e+08 1.4 0.0e+00 0.0e+00
                                          ^^^^^
                                        balance

   Hmmm, I would guess that the matrix entries are not so well balanced?
One process takes 1.4 times as long for the triangular solves as the other
so either one matrix has many more entries or one processor is slower then
the other.

    Barry

On Tue, 15 Aug 2006, Matt Funk wrote:

> Hi Matt,
>
> sorry for the delay since the last email, but there were some other things i
> needed to do.
>
> Anyway, I hope that maybe I can get some more help from you guys with respect
> to the loadimbalance problem i have. Here is the situtation:
> I run my code on 2 procs. I profile my KSPSolve call and here is what i get:
>
> ...
>
> --- Event Stage 2: Stage 2 of ChomboPetscInterface
>
> VecDot             20000 1.0 1.9575e+01 2.0 2.39e+08 2.0 0.0e+00 0.0e+00
> 2.0e+04  2  8  0  0 56   7  8  0  0 56   245
> VecNorm            16000 1.0 4.0559e+01 3.2 1.53e+08 3.2 0.0e+00 0.0e+00
> 1.6e+04  3  7  0  0 44  13  7  0  0 44    95
> VecCopy             4000 1.0 1.5148e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
> VecSet             16000 1.0 3.1937e+00 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
> VecAXPY            16000 1.0 8.2395e+00 1.4 3.22e+08 1.4 0.0e+00 0.0e+00
> 0.0e+00  1  7  0  0  0   3  7  0  0  0   465
> VecAYPX             8000 1.0 3.6370e+00 1.3 3.46e+08 1.3 0.0e+00 0.0e+00
> 0.0e+00  0  3  0  0  0   2  3  0  0  0   527
> VecScatterBegin    12000 1.0 5.7721e-01 1.1 0.00e+00 0.0 2.4e+04 2.1e+04
> 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd      12000 1.0 1.4059e+01 9.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   4  0  0  0  0     0
> MatMult            12000 1.0 5.0567e+01 1.0 1.82e+08 1.0 2.4e+04 2.1e+04
> 0.0e+00  5 32100100  0  25 32100100  0   361
> MatSolve           16000 1.0 1.1285e+02 1.4 1.50e+08 1.4 0.0e+00 0.0e+00
> 0.0e+00 10 43  0  0  0  47 43  0  0  0   214
> MatLUFactorNum         1 1.0 1.9693e-02 1.0 5.79e+07 1.1 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   110
> MatILUFactorSym        1 1.0 7.8840e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.2250e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetup               1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve            4000 1.0 2.0419e+02 1.0 1.39e+08 1.0 2.4e+04 2.1e+04
> 3.6e+04 21100100100100 100100100100100   278
> PCSetUp                1 1.0 2.8828e-02 1.1 3.99e+07 1.1 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    75
> PCSetUpOnBlocks     4000 1.0 3.5605e-02 1.1 3.35e+07 1.1 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    61
> PCApply            16000 1.0 1.1661e+02 1.4 1.46e+08 1.4 0.0e+00 0.0e+00
> 0.0e+00 10 43  0  0  0  49 43  0  0  0   207
> ------------------------------------------------------------------------------------------------------------------------
>
> ...
>
>
> Some things to note are the following:
> I allocate my vector as:
> VecCreateMPI(PETSC_COMM_WORLD, //communicator
> 	       a_totallocal_numPoints[a_thisproc], //local points on this proc
> 	       a_totalglobal_numPoints, //total number of global points
> 	       &m_globalRHSVector); //the vector to be created
>
> where the vector a_totallocal_numPoints is :
> a_totallocal_numPoints: 59904 59904
>
> The matrix is allocated as:
>  m_ierr = MatCreateMPIAIJ(PETSC_COMM_WORLD, //communicator
> 			   a_totallocal_numPoints[a_thisproc], //total number of local rows (that
> is rows residing on this proc
> 			   a_totallocal_numPoints[a_thisproc], //total number of columns
> corresponding to local part of parallel vector
> 			   a_totalglobal_numPoints, //number of global rows
> 			   a_totalglobal_numPoints, //number of global columns
> 			   PETSC_NULL,
> 			   a_NumberOfNZPointsInDiagonalMatrix,
> 			   PETSC_NULL,
> 			   a_NumberOfNZPointsInOffDiagonalMatrix,
> 			   &m_globalMatrix);
>
> With the info option i checked and there is no extra mallocs at all.
> My problems setup is symmetric so it seems that everything is set up so that
> it should be essentially perfectly balanced. However, the numbers given above
> certainly do not reflect that.
>
> However, the in all other parts of my code (except the PETSc call), i get the
> expected, almost perfect loadbalance.
>
> Is there anything that i am overlooking? Any help is greatly appreciated.
>
> thanks
> mat
>
>
>
> On Wednesday 02 August 2006 16:21, Matt Funk wrote:
>> Hi Matt,
>>
>> It could be a bad load imbalance because i don't let PETSc decide. I need
>> to fix that anyway, so i think i'll try that first and then let you know.
>> Thanks though for the quick response and helping me to interpret those
>> numbers ...
>>
>>
>> mat
>>
>> On Wednesday 02 August 2006 15:50, Matthew Knepley wrote:
>>> On 8/2/06, Matt Funk <mafunk at nmsu.edu> wrote:
>>>> Hi Matt,
>>>>
>>>> thanks for all the help so far. The -info option is really very
>>>> helpful. So i think i straightened the actual errors out. However, now
>>>> i am back to the original question i had. That is why it takes so much
>>>> longer on 4 procs than on 1 proc.
>>>
>>> So you have a 1.5 load imbalance for MatMult(), which probably cascades
>>> to give the 133! load imbalance for VecDot(). You probably have either:
>>>
>>>   1) VERY bad laod imbalance
>>>
>>>   2) a screwed up network
>>>
>>>   3) bad contention on the network (loaded cluster)
>>>
>>> Can you help us narrow this down?
>>>
>>>
>>>    Matt
>>>
>>>> I profiled the KSPSolve(...) as stage 2:
>>>>
>>>> For 1 proc i have:
>>>> --- Event Stage 2: Stage 2 of ChomboPetscInterface
>>>>
>>>> VecDot              4000 1.0 4.9158e-02 1.0 4.74e+08 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0 18  0  0  0   2 18  0  0  0   474
>>>> VecNorm             8000 1.0 2.1798e-01 1.0 2.14e+08 1.0 0.0e+00
>>>> 0.0e+00 4.0e+03  1 36  0  0 28   7 36  0  0 33   214
>>>> VecAYPX             4000 1.0 1.3449e-01 1.0 1.73e+08 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0 18  0  0  0   5 18  0  0  0   173
>>>> MatMult             4000 1.0 3.6004e-01 1.0 3.24e+07 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00  1  9  0  0  0  12  9  0  0  0    32
>>>> MatSolve            8000 1.0 1.0620e+00 1.0 2.19e+07 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00  3 18  0  0  0  36 18  0  0  0    22
>>>> KSPSolve            4000 1.0 2.8338e+00 1.0 4.52e+07 1.0 0.0e+00
>>>> 0.0e+00 1.2e+04  7100  0  0 84  97100  0  0100    45
>>>> PCApply             8000 1.0 1.1133e+00 1.0 2.09e+07 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00  3 18  0  0  0  38 18  0  0  0    21
>>>>
>>>>
>>>> for 4 procs i have :
>>>> --- Event Stage 2: Stage 2 of ChomboPetscInterface
>>>>
>>>> VecDot              4000 1.0 3.5884e+01133.7 2.17e+07133.7 0.0e+00
>>>> 0.0e+00 4.0e+03  8 18  0  0  5   9 18  0  0 14     1
>>>> VecNorm             8000 1.0 3.4986e-01 1.3 4.43e+07 1.3 0.0e+00
>>>> 0.0e+00 8.0e+03  0 36  0  0 10   0 36  0  0 29   133
>>>> VecSet              8000 1.0 3.5024e-02 1.4 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAYPX             4000 1.0 5.6790e-02 1.3 1.28e+08 1.3 0.0e+00
>>>> 0.0e+00 0.0e+00  0 18  0  0  0   0 18  0  0  0   410
>>>> VecScatterBegin     4000 1.0 6.0042e+01 1.4 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00 38  0  0  0  0  45  0  0  0  0     0
>>>> VecScatterEnd       4000 1.0 5.9364e+01 1.4 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00 37  0  0  0  0  44  0  0  0  0     0
>>>> MatMult             4000 1.0 1.1959e+02 1.4 3.46e+04 1.4 0.0e+00
>>>> 0.0e+00 0.0e+00 75  9  0  0  0  89  9  0  0  0     0
>>>> MatSolve            8000 1.0 2.8150e-01 1.0 2.16e+07 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0 18  0  0  0   0 18  0  0  0    83
>>>> MatLUFactorNum         1 1.0 1.3685e-04 1.1 5.64e+06 1.1 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    21
>>>> MatILUFactorSym        1 1.0 2.3389e-04 1.2 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetOrdering         1 1.0 9.6083e-05 1.2 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSetup               1 1.0 2.1458e-06 2.2 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSolve            4000 1.0 1.2200e+02 1.0 2.63e+05 1.0 0.0e+00
>>>> 0.0e+00 2.8e+04 84100  0  0 34 100100  0  0100     1
>>>> PCSetUp                1 1.0 5.0187e-04 1.2 1.68e+06 1.2 0.0e+00
>>>> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     6
>>>> PCSetUpOnBlocks     4000 1.0 1.2104e-02 2.2 1.34e+05 2.2 0.0e+00
>>>> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> PCApply             8000 1.0 8.4254e-01 1.2 8.27e+06 1.2 0.0e+00
>>>> 0.0e+00 8.0e+03  1 18  0  0 10   1 18  0  0 29    28
>>>> -----------------------------------------------------------------------
>>>> -- -----------------------------------------------
>>>>
>>>> Now if i understand it right, all these calls summarize all calls
>>>> between the pop and push commands. That would mean that the majority of
>>>> the time is spend in the MatMult and in within that the VecScatterBegin
>>>> and VecScatterEnd commands (if i understand it right).
>>>>
>>>> My problem size is really small. So i was wondering if the problem lies
>>>> in that (namely that the major time is simply spend communicating
>>>> between processors, or whether there is still something wrong with how
>>>> i wrote the code?)
>>>>
>>>>
>>>> thanks
>>>> mat
>>>>
>>>> On Tuesday 01 August 2006 18:28, Matthew Knepley wrote:
>>>>> On 8/1/06, Matt Funk <mafunk at nmsu.edu> wrote:
>>>>>> Actually the errors occur on my calls to a PETSc functions after
>>>>>> calling PETSCInitialize.
>>>>>
>>>>> Yes, it is the error I pointed out in the last message.
>>>>>
>>>>>    Matt
>>>>>
>>>>>> mat
>
>




More information about the petsc-users mailing list