profiling PETSc code

Matt Funk mafunk at nmsu.edu
Tue Aug 15 16:51:52 CDT 2006


Is there a limit to how big an attachment can be?
The file is 1.3Mb big. I tried to send it twice and none of the emails went 
through. I also send it directly to Barry and Matthews email. I hope that got 
though?

mat


On Tuesday 15 August 2006 14:44, Matthew Knepley wrote:
> I don't think it matters initially since the problem is BIG imbalances.
>
>   Matt
>
> On 8/15/06, Matt Funk <mafunk at nmsu.edu> wrote:
> > Do you want me to use the debug version or the optimized version of
> > PETSc?
> >
> > mat
> >
> > On Tuesday 15 August 2006 13:56, Barry Smith wrote:
> > >    Please send the entire -info output as an attachment to me. (Not
> > > in the email) I'll study it in more detail.
> > >
> > >     Barry
> > >
> > > On Tue, 15 Aug 2006, Matt Funk wrote:
> > > > On Tuesday 15 August 2006 11:52, Barry Smith wrote:
> > > >>> MatSolve           16000 1.0 1.1285e+02 1.4 1.50e+08 1.4 0.0e+00
> > > >>> 0.0e+00
> > > >>
> > > >>                                           ^^^^^
> > > >>                                         balance
> > > >>
> > > >>    Hmmm, I would guess that the matrix entries are not so well
> > > >> balanced? One process takes 1.4 times as long for the triangular
> > > >> solves as the other so either one matrix has many more entries or
> > > >> one processor is slower then the other.
> > > >>
> > > >>     Barry
> > > >
> > > > Well it would seem that way at first, but i don't know how that could
> > > > be since i allocate an exactly equal amount of points on both
> > > > processor (see previous email).
> > > > Further i used the -mat_view_info option. Here is what it gives me:
> > > >
> > > > ...
> > > > Matrix Object:
> > > >  type=mpiaij, rows=119808, cols=119808
> > > > [1] PetscCommDuplicateUsing internal PETSc communicator 91 168
> > > > [1] PetscCommDuplicateUsing internal PETSc communicator 91 168
> > > > ...
> > > >
> > > > ...
> > > > Matrix Object:
> > > >  type=seqaij, rows=59904, cols=59904
> > > >  total: nonzeros=407400, allocated nonzeros=407400
> > > >    not using I-node routines
> > > > Matrix Object:
> > > >  type=seqaij, rows=59904, cols=59904
> > > >  total: nonzeros=407400, allocated nonzeros=407400
> > > >    not using I-node routines
> > > > ...
> > > >
> > > > So to me it look s well split up. Is there anything else that
> > > > somebody can think of. The machine i am running on is all same
> > > > processors.
> > > >
> > > > By the way, i am not sure if the '[1] PetscCommDuplicateUsing
> > > > internal PETSc communicator 91 168' is something i need to worry
> > > > about??
> > > >
> > > > mat
> > > >
> > > >> On Tue, 15 Aug 2006, Matt Funk wrote:
> > > >>> Hi Matt,
> > > >>>
> > > >>> sorry for the delay since the last email, but there were some other
> > > >>> things i needed to do.
> > > >>>
> > > >>> Anyway, I hope that maybe I can get some more help from you guys
> > > >>> with respect to the loadimbalance problem i have. Here is the
> > > >>> situtation: I run my code on 2 procs. I profile my KSPSolve call
> > > >>> and here is what i get:
> > > >>>
> > > >>> ...
> > > >>>
> > > >>> --- Event Stage 2: Stage 2 of ChomboPetscInterface
> > > >>>
> > > >>> VecDot             20000 1.0 1.9575e+01 2.0 2.39e+08 2.0 0.0e+00
> > > >>> 0.0e+00 2.0e+04  2  8  0  0 56   7  8  0  0 56   245
> > > >>> VecNorm            16000 1.0 4.0559e+01 3.2 1.53e+08 3.2 0.0e+00
> > > >>> 0.0e+00 1.6e+04  3  7  0  0 44  13  7  0  0 44    95
> > > >>> VecCopy             4000 1.0 1.5148e+00 1.4 0.00e+00 0.0 0.0e+00
> > > >>> 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
> > > >>> VecSet             16000 1.0 3.1937e+00 1.8 0.00e+00 0.0 0.0e+00
> > > >>> 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
> > > >>> VecAXPY            16000 1.0 8.2395e+00 1.4 3.22e+08 1.4 0.0e+00
> > > >>> 0.0e+00 0.0e+00  1  7  0  0  0   3  7  0  0  0   465
> > > >>> VecAYPX             8000 1.0 3.6370e+00 1.3 3.46e+08 1.3 0.0e+00
> > > >>> 0.0e+00 0.0e+00  0  3  0  0  0   2  3  0  0  0   527
> > > >>> VecScatterBegin    12000 1.0 5.7721e-01 1.1 0.00e+00 0.0 2.4e+04
> > > >>> 2.1e+04 0.0e+00  0  0100100  0   0  0100100  0     0
> > > >>> VecScatterEnd      12000 1.0 1.4059e+01 9.2 0.00e+00 0.0 0.0e+00
> > > >>> 0.0e+00 0.0e+00  1  0  0  0  0   4  0  0  0  0     0
> > > >>> MatMult            12000 1.0 5.0567e+01 1.0 1.82e+08 1.0 2.4e+04
> > > >>> 2.1e+04 0.0e+00  5 32100100  0  25 32100100  0   361
> > > >>> MatSolve           16000 1.0 1.1285e+02 1.4 1.50e+08 1.4 0.0e+00
> > > >>> 0.0e+00 0.0e+00 10 43  0  0  0  47 43  0  0  0   214
> > > >>> MatLUFactorNum         1 1.0 1.9693e-02 1.0 5.79e+07 1.1 0.0e+00
> > > >>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   110
> > > >>> MatILUFactorSym        1 1.0 7.8840e-03 1.1 0.00e+00 0.0 0.0e+00
> > > >>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>> MatGetOrdering         1 1.0 1.2250e-03 1.1 0.00e+00 0.0 0.0e+00
> > > >>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>> KSPSetup               1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00
> > > >>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>> KSPSolve            4000 1.0 2.0419e+02 1.0 1.39e+08 1.0 2.4e+04
> > > >>> 2.1e+04 3.6e+04 21100100100100 100100100100100   278
> > > >>> PCSetUp                1 1.0 2.8828e-02 1.1 3.99e+07 1.1 0.0e+00
> > > >>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    75
> > > >>> PCSetUpOnBlocks     4000 1.0 3.5605e-02 1.1 3.35e+07 1.1 0.0e+00
> > > >>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    61
> > > >>> PCApply            16000 1.0 1.1661e+02 1.4 1.46e+08 1.4 0.0e+00
> > > >>> 0.0e+00 0.0e+00 10 43  0  0  0  49 43  0  0  0   207
> > > >>> -------------------------------------------------------------------
> > > >>>---- -- -----------------------------------------------
> > > >>>
> > > >>> ...
> > > >>>
> > > >>>
> > > >>> Some things to note are the following:
> > > >>> I allocate my vector as:
> > > >>> VecCreateMPI(PETSC_COMM_WORLD, //communicator
> > > >>>          a_totallocal_numPoints[a_thisproc], //local points on this
> > > >>> proc a_totalglobal_numPoints, //total number of global points
> > > >>> &m_globalRHSVector); //the vector to be created
> > > >>>
> > > >>> where the vector a_totallocal_numPoints is :
> > > >>> a_totallocal_numPoints: 59904 59904
> > > >>>
> > > >>> The matrix is allocated as:
> > > >>>  m_ierr = MatCreateMPIAIJ(PETSC_COMM_WORLD, //communicator
> > > >>>                      a_totallocal_numPoints[a_thisproc], //total
> > > >>> number of local rows (that is rows residing on this proc
> > > >>>                      a_totallocal_numPoints[a_thisproc], //total
> > > >>> number of columns corresponding to local part of parallel vector
> > > >>>                      a_totalglobal_numPoints, //number of global
> > > >>> rows a_totalglobal_numPoints, //number of global columns
> > > >>> PETSC_NULL,
> > > >>>                      a_NumberOfNZPointsInDiagonalMatrix,
> > > >>>                      PETSC_NULL,
> > > >>>                      a_NumberOfNZPointsInOffDiagonalMatrix,
> > > >>>                      &m_globalMatrix);
> > > >>>
> > > >>> With the info option i checked and there is no extra mallocs at
> > > >>> all. My problems setup is symmetric so it seems that everything is
> > > >>> set up so that it should be essentially perfectly balanced.
> > > >>> However, the numbers given above certainly do not reflect that.
> > > >>>
> > > >>> However, the in all other parts of my code (except the PETSc call),
> > > >>> i get the expected, almost perfect loadbalance.
> > > >>>
> > > >>> Is there anything that i am overlooking? Any help is greatly
> > > >>> appreciated.
> > > >>>
> > > >>> thanks
> > > >>> mat
> > > >>>
> > > >>> On Wednesday 02 August 2006 16:21, Matt Funk wrote:
> > > >>>> Hi Matt,
> > > >>>>
> > > >>>> It could be a bad load imbalance because i don't let PETSc decide.
> > > >>>> I need to fix that anyway, so i think i'll try that first and then
> > > >>>> let you know. Thanks though for the quick response and helping me
> > > >>>> to interpret those numbers ...
> > > >>>>
> > > >>>>
> > > >>>> mat
> > > >>>>
> > > >>>> On Wednesday 02 August 2006 15:50, Matthew Knepley wrote:
> > > >>>>> On 8/2/06, Matt Funk <mafunk at nmsu.edu> wrote:
> > > >>>>>> Hi Matt,
> > > >>>>>>
> > > >>>>>> thanks for all the help so far. The -info option is really very
> > > >>>>>> helpful. So i think i straightened the actual errors out.
> > > >>>>>> However, now i am back to the original question i had. That is
> > > >>>>>> why it takes so much longer on 4 procs than on 1 proc.
> > > >>>>>
> > > >>>>> So you have a 1.5 load imbalance for MatMult(), which probably
> > > >>>>> cascades to give the 133! load imbalance for VecDot(). You
> > > >>>>> probably have either:
> > > >>>>>
> > > >>>>>   1) VERY bad laod imbalance
> > > >>>>>
> > > >>>>>   2) a screwed up network
> > > >>>>>
> > > >>>>>   3) bad contention on the network (loaded cluster)
> > > >>>>>
> > > >>>>> Can you help us narrow this down?
> > > >>>>>
> > > >>>>>
> > > >>>>>    Matt
> > > >>>>>
> > > >>>>>> I profiled the KSPSolve(...) as stage 2:
> > > >>>>>>
> > > >>>>>> For 1 proc i have:
> > > >>>>>> --- Event Stage 2: Stage 2 of ChomboPetscInterface
> > > >>>>>>
> > > >>>>>> VecDot              4000 1.0 4.9158e-02 1.0 4.74e+08 1.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0 18  0  0  0   2 18  0  0  0   474
> > > >>>>>> VecNorm             8000 1.0 2.1798e-01 1.0 2.14e+08 1.0 0.0e+00
> > > >>>>>> 0.0e+00 4.0e+03  1 36  0  0 28   7 36  0  0 33   214
> > > >>>>>> VecAYPX             4000 1.0 1.3449e-01 1.0 1.73e+08 1.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0 18  0  0  0   5 18  0  0  0   173
> > > >>>>>> MatMult             4000 1.0 3.6004e-01 1.0 3.24e+07 1.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  1  9  0  0  0  12  9  0  0  0    32
> > > >>>>>> MatSolve            8000 1.0 1.0620e+00 1.0 2.19e+07 1.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  3 18  0  0  0  36 18  0  0  0    22
> > > >>>>>> KSPSolve            4000 1.0 2.8338e+00 1.0 4.52e+07 1.0 0.0e+00
> > > >>>>>> 0.0e+00 1.2e+04  7100  0  0 84  97100  0  0100    45
> > > >>>>>> PCApply             8000 1.0 1.1133e+00 1.0 2.09e+07 1.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  3 18  0  0  0  38 18  0  0  0    21
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> for 4 procs i have :
> > > >>>>>> --- Event Stage 2: Stage 2 of ChomboPetscInterface
> > > >>>>>>
> > > >>>>>> VecDot              4000 1.0 3.5884e+01133.7 2.17e+07133.7
> > > >>>>>> 0.0e+00 0.0e+00 4.0e+03  8 18  0  0  5   9 18  0  0 14     1
> > > >>>>>> VecNorm             8000 1.0 3.4986e-01 1.3 4.43e+07 1.3 0.0e+00
> > > >>>>>> 0.0e+00 8.0e+03  0 36  0  0 10   0 36  0  0 29   133
> > > >>>>>> VecSet              8000 1.0 3.5024e-02 1.4 0.00e+00 0.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>>>>> VecAYPX             4000 1.0 5.6790e-02 1.3 1.28e+08 1.3 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0 18  0  0  0   0 18  0  0  0   410
> > > >>>>>> VecScatterBegin     4000 1.0 6.0042e+01 1.4 0.00e+00 0.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00 38  0  0  0  0  45  0  0  0  0     0
> > > >>>>>> VecScatterEnd       4000 1.0 5.9364e+01 1.4 0.00e+00 0.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00 37  0  0  0  0  44  0  0  0  0     0
> > > >>>>>> MatMult             4000 1.0 1.1959e+02 1.4 3.46e+04 1.4 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00 75  9  0  0  0  89  9  0  0  0     0
> > > >>>>>> MatSolve            8000 1.0 2.8150e-01 1.0 2.16e+07 1.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0 18  0  0  0   0 18  0  0  0    83
> > > >>>>>> MatLUFactorNum         1 1.0 1.3685e-04 1.1 5.64e+06 1.1 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    21
> > > >>>>>> MatILUFactorSym        1 1.0 2.3389e-04 1.2 0.00e+00 0.0 0.0e+00
> > > >>>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>>>>> MatGetOrdering         1 1.0 9.6083e-05 1.2 0.00e+00 0.0 0.0e+00
> > > >>>>>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>>>>> KSPSetup               1 1.0 2.1458e-06 2.2 0.00e+00 0.0 0.0e+00
> > > >>>>>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>>>>> KSPSolve            4000 1.0 1.2200e+02 1.0 2.63e+05 1.0 0.0e+00
> > > >>>>>> 0.0e+00 2.8e+04 84100  0  0 34 100100  0  0100     1
> > > >>>>>> PCSetUp                1 1.0 5.0187e-04 1.2 1.68e+06 1.2 0.0e+00
> > > >>>>>> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     6
> > > >>>>>> PCSetUpOnBlocks     4000 1.0 1.2104e-02 2.2 1.34e+05 2.2 0.0e+00
> > > >>>>>> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > >>>>>> PCApply             8000 1.0 8.4254e-01 1.2 8.27e+06 1.2 0.0e+00
> > > >>>>>> 0.0e+00 8.0e+03  1 18  0  0 10   1 18  0  0 29    28
> > > >>>>>> ----------------------------------------------------------------
> > > >>>>>>---- -- - -- -----------------------------------------------
> > > >>>>>>
> > > >>>>>> Now if i understand it right, all these calls summarize all
> > > >>>>>> calls between the pop and push commands. That would mean that
> > > >>>>>> the majority of the time is spend in the MatMult and in within
> > > >>>>>> that the VecScatterBegin and VecScatterEnd commands (if i
> > > >>>>>> understand it right).
> > > >>>>>>
> > > >>>>>> My problem size is really small. So i was wondering if the
> > > >>>>>> problem lies in that (namely that the major time is simply spend
> > > >>>>>> communicating between processors, or whether there is still
> > > >>>>>> something wrong with how i wrote the code?)
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> thanks
> > > >>>>>> mat
> > > >>>>>>
> > > >>>>>> On Tuesday 01 August 2006 18:28, Matthew Knepley wrote:
> > > >>>>>>> On 8/1/06, Matt Funk <mafunk at nmsu.edu> wrote:
> > > >>>>>>>> Actually the errors occur on my calls to a PETSc functions
> > > >>>>>>>> after calling PETSCInitialize.
> > > >>>>>>>
> > > >>>>>>> Yes, it is the error I pointed out in the last message.
> > > >>>>>>>
> > > >>>>>>>    Matt
> > > >>>>>>>
> > > >>>>>>>> mat




More information about the petsc-users mailing list