[petsc-users] Function evaluation slowness ?

Tue Aug 25 21:19:56 CDT 2015

Hi,

Problem solved !

Several points to be noticed :

1. First the discrepancy between creations and destructions showed I had
forgotten some VecDestroy, and also a VecRestoreArrayReadF90. Repairing
this seemed to increase a bit the speed but did not solve the actual more
serious problem seen in the log_summary.

2. The actual problem was a very stupid one on my side. At some point I
print small diagnostics at every time step to a text file with standard
Fortran write statement rather than a viewer to a binary file. I had simply
forgotten to put the statement between an if statement on the rank

if (rank.eq.0) then

    write(50) ....

end if

So all the processors were trying to write together to the file, which, I
suppose, somehow caused all the Scatters. After adding the if statement, I
recover a fast speed (about 10 ms per function evaluation). Thank you so
much for your help, I would never have made it so far without it !!!

3. Last minor point, a discrepancy of 1 remains between creations and
destructions (see below extract of log_summary) for the category "viewer".
I have checked that it is also the case for the examples ex5f90.F and ex5.c
on which my code is based. I can't track it down, but it's probably a minor
point anyway.

--- Event Stage 0: Main Stage

                SNES     1              1         1332     0
      SNESLineSearch     1              1          864     0
              DMSNES     2              2         1328     0
              Vector    20             20     34973120     0
      Vector Scatter     3              3       503488     0
             MatMFFD     1              1          768     0
              Matrix     1              1         2304     0
    Distributed Mesh     3              3        14416     0
Star Forest Bipartite Graph     6              6         5024     0
     Discrete System     3              3         2544     0
           Index Set     6              6       187248     0
   IS L to G Mapping     2              2       184524     0
       Krylov Solver     1              1         1304     0
     DMKSP interface     1              1          648     0
      Preconditioner     1              1          880     0
              Viewer     3              2         1536     0
         PetscRandom     1              1          624     0
========================================================================================================================

Best

Timothee

2015-08-26 2:39 GMT+09:00 Barry Smith <bsmith at mcs.anl.gov>:

>
> > On Aug 25, 2015, at 2:06 AM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> >
> > OK, I see,
> >
> > Might it be that I do something a bit funky to obtain a good guess for
> solve ? I had he following idea, which I used with great success on a very
> different problem (much simpler, maybe that's why it worked) : obtain the
> initial guess as a cubic extrapolation of the preceding solutions. The idea
> is that I expect my solution to be reasonably smooth over time, so
> considering this, the increment of the fields should also be continuous (I
> solve for the increments, not the fields themselves). Therefore, I store in
> my user context the current vector Xk as well as the last three solutions
> Xkm1 and Xkm2.
> >
> > I define
> >
> > dxm2 = Xkm1 - Xkm2
> > dxm1 = Xk - Xkm1
> >
> > And I use the result of the last SNESSolve as
> >
> > dx = Xkp1 - Xk
> >
> > Then I set the new dx initial guess as the pointwise cubic extrapolation
> of (dxm2,dxm1,dx)
> >
> > However it seems pretty local and I don't see why scatters would be
> required for this.
>
>    Yes, no scatters here.
>
> >
> > I printed the routine I use to do this below. In any case I will clean
> up a bit, remove the extra stuff (not much there however). If it is not
> sufficient, I will transform my form function in a dummy which does not
> require computations and see what happens.
> >
> > Timothee
> >
> >   PetscErrorCode :: ierr
> >
> >   PetscScalar :: M(3,3)
> >   Vec         :: xkm2,xkm1
> >   Vec         :: coef1,coef2,coef3
> >   PetscScalar :: a,b,c,t,det
> >
> >   a = user%tkm1
> >   b = user%tk
> >   c = user%t
> >   t = user%t+user%dt
> >
> >   det = b*a**2 + c*b**2 + a*c**2 - (c*a**2 + a*b**2 + b*c**2)
> >
> >   M(1,1) = (b-c)/det
> >   M(2,1) = (c**2-b**2)/det
> >   M(3,1) = (c*b**2-b*c**2)/det
> >
> >   M(1,2) = (c-a)/det
> >   M(2,2) = (a**2-c**2)/det
> >   M(3,2) = (a*c**2-c*a**2)/det
> >
> >   M(1,3) = (a-b)/det
> >   M(2,3) = (b**2-a**2)/det
> >   M(3,3) = (b*a**2-a*b**2)/det
> >
> >   call VecDuplicate(x,xkm1,ierr)
> >   call VecDuplicate(x,xkm2,ierr)
> >
> >   call VecDuplicate(x,coef1,ierr)
> >   call VecDuplicate(x,coef2,ierr)
> >   call VecDuplicate(x,coef3,ierr)
> >
> >   call VecWAXPY(xkm2,-one,user%Xkm2,user%Xkm1,ierr)
> >   call VecWAXPY(xkm1,-one,user%Xkm1,user%Xk,ierr)
> >
> >   ! The following lines correspond to the following simple operation
> >   ! coef1 = M(1,1)*alpha + M(1,2)*beta + M(1,3)*gamma
> >   ! coef2 = M(2,1)*alpha + M(2,2)*beta + M(2,3)*gamma
> >   ! coef3 = M(3,1)*alpha + M(3,2)*beta + M(3,3)*gamma
> >   call VecCopy(xkm2,coef1,ierr)
> >   call VecScale(coef1,M(1,1),ierr)
> >   call VecAXPY(coef1,M(1,2),xkm1,ierr)
> >   call VecAXPY(coef1,M(1,3),x,ierr)
> >
> >   call VecCopy(xkm2,coef2,ierr)
> >   call VecScale(coef2,M(2,1),ierr)
> >   call VecAXPY(coef2,M(2,2),xkm1,ierr)
> >   call VecAXPY(coef2,M(2,3),x,ierr)
> >
> >   call VecCopy(xkm2,coef3,ierr)
> >   call VecScale(coef3,M(3,1),ierr)
> >   call VecAXPY(coef3,M(3,2),xkm1,ierr)
> >   call VecAXPY(coef3,M(3,3),x,ierr)
> >
> >   call VecCopy(coef3,x,ierr)
> >   call VecAXPY(x,t,coef2,ierr)
> >   call VecAXPY(x,t**2,coef1,ierr)
> >
> >   call VecDestroy(xkm2,ierr)
> >   call VecDestroy(xkm1,ierr)
> >
> >   call VecDestroy(coef1,ierr)
> >   call VecDestroy(coef2,ierr)
> >   call VecDestroy(coef3,ierr)
> >
> >
> >
> > 2015-08-25 15:47 GMT+09:00 Barry Smith <bsmith at mcs.anl.gov>:
> >
> >   The results are kind of funky,
> >
> >
> ------------------------------------------------------------------------------------------------------------------------
> > Event                Count      Time (sec)     Flops
>          --- Global ---  --- Stage ---   Total
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> >
> ------------------------------------------------------------------------------------------------------------------------
> > SNESSolve             40 1.0 4.9745e+02 3.3 4.25e+09 1.0 1.7e+06 3.8e+04
> 2.7e+03 46 93 99 95 80  46 93 99 95 80  2187
> > SNESFunctionEval     666 1.0 4.8990e+02 3.4 5.73e+08 1.0 1.7e+06 3.8e+04
> 1.3e+03 45 13 99 95 40  45 13 99 95 40   299
> > SNESLineSearch        79 1.0 3.8578e+00 1.0 4.98e+08 1.0 4.0e+05 3.8e+04
> 6.3e+02  1 11 23 23 19   1 11 23 23 19 33068
> > VecScatterEnd       1335 1.0 3.4761e+02 5.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 31  0  0  0  0  31  0  0  0  0     0
> > MatMult MF           547 1.0 1.2570e+01 1.1 1.27e+09 1.0 1.4e+06 3.8e+04
> 1.1e+03  2 28 81 78 34   2 28 81 78 34 25962
> > MatMult              547 1.0 1.2571e+01 1.1 1.27e+09 1.0 1.4e+06 3.8e+04
> 1.1e+03  2 28 81 78 34   2 28 81 78 34 25960
> >
> > look at the %T time for global SNES solve is 46 % of the total time,
> function evaluations are 45% but MatMult are only 2% (and yet matmult
> should contain most of the function evaluations). I cannot explain this.
> Also the VecScatterEnd is HUGE and has a bad load balance of 5.8  Why are
> there so many more scatters than function evaluations? What other
> operations are you doing that require scatters?
> >
> > It's almost like you have some mysterious "extra" function calls outside
> of the SNESSolve that are killing the performance? It might help to
> understand the performance to strip out all extraneous computations not
> needed (like in custom monitors etc).
> >
> >  Barry
> >
> >
> >
> >
> >
> >
> > > On Aug 25, 2015, at 1:21 AM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > >
> > > Here is the log summary (attached). At the beginning are personal
> prints, you can skip. I seem to have a memory crash in the present state
> after typically 45 iterations (that's why I used 40 here), the log summary
> indicates some creations without destruction of Petsc objects (I will fix
> this immediately), that may cause the memory crash, but I don't think it's
> the cause of the slow function evaluations.
> > >
> > > The log_summary is consistent with 0.7s per function evaluation
> (4.8990e+02/666 = 0.736). In addition, SNESSolve itself takes approximately
> the same amount of time (is it normal ?). And the other long operation is
> VecScatterEnd. I assume it is the time used in process communications ? In
> which case I suppose it is normal that it takes a significant amount of
> time.
> > >
> > > So this ~10 times increase does not look normal right ?
> > >
> > > Best
> > >
> > > Timothee NICOLAS
> > >
> > >
> > > 2015-08-25 14:56 GMT+09:00 Barry Smith <bsmith at mcs.anl.gov>:
> > >
> > > > On Aug 25, 2015, at 12:45 AM, Timothée Nicolas <
> timothee.nicolas at gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am testing PETSc on the supercomputer where I used to run my
> explicit MHD code. For my tests I use 256 processes on a problem of size
> 128*128*640 = 10485760, that is, 40960 grid points per process, and 8
> degrees of freedom (or physical fields). The explicit code was using
> Runge-Kutta 4 for the time scheme, which means 4 function evaluation per
> time step (plus one operation to put everything together, but let's forget
> this one).
> > > >
> > > > I could thus easily determine that the typical time required for a
> function evaluation was of the order of 50 ms.
> > > >
> > > > Now with the implicit Newton-Krylov solver written in PETSc, in the
> present state where for now I have not implemented any Jacobian or
> preconditioner whatsoever (so I run with -snes_mf), I measure a typical
> time between two time steps of between 5 and 20 seconds, and the number of
> function evaluations for each time step obtained with
> SNESGetNumberFunctionEvals is 17 (I am speaking of a particular case of
> course)
> > > >
> > > > This means a time per function evaluation of about 0.5 to 1 second,
> that is, 10 to 20 times slower.
> > > >
> > > > So I have some questions about this.
> > > >
> > > > 1. First does SNESGetNumberFunctionEvals take into account the
> function evaluations required to evaluate the Jacobian when -snes_mf is
> used, as well as the operations required by the GMRES (Krylov) method ? If
> it were the case, I would somehow intuitively expect a number larger than
> 17, which could explain the increase in time.
> > >
> > > PetscErrorCode  SNESGetNumberFunctionEvals(SNES snes, PetscInt *nfuncs)
> > > {
> > >   *nfuncs = snes->nfuncs;
> > > }
> > >
> > > PetscErrorCode  SNESComputeFunction(SNES snes,Vec x,Vec y)
> > > {
> > > ...
> > >   snes->nfuncs++;
> > > }
> > >
> > > PetscErrorCode  MatCreateSNESMF(SNES snes,Mat *J)
> > > {
> > > .....
> > >   if (snes->pc && snes->pcside == PC_LEFT) {
> > >     ierr = MatMFFDSetFunction(*J,(PetscErrorCode
> (*)(void*,Vec,Vec))SNESComputeFunctionDefaultNPC,snes);CHKERRQ(ierr);
> > >   } else {
> > >     ierr = MatMFFDSetFunction(*J,(PetscErrorCode
> (*)(void*,Vec,Vec))SNESComputeFunction,snes);CHKERRQ(ierr);
> > >   }
> > > }
> > >
> > >   So, yes I would expect all the function evaluations needed for the
> matrix-free Jacobian matrix vector product to be counted. You can also look
> at the number of GMRES Krylov iterations it took (which should have one
> multiply per iteration) to double check that the numbers make sense.
> > >
> > >   What does your -log_summary output look like? One thing that GMRES
> does is it introduces a global reduction with each multiple (hence a
> barrier across all your processes) on some systems this can be deadly.
> > >
> > >   Barry
> > >
> > >
> > > >
> > > > 2. In any case, I thought that all things considered, the function
> evaluation would be the most time consuming part of a Newton-Krylov solver,
> am I completely wrong about that ? Is the 10-20 factor legit ?
> > > >
> > > > I realize of course that preconditioning should make all this
> smoother, in particular allowing larger time steps, but here I am just
> concerned about the sheer Function evaluation time.
> > > >
> > > > Best regards
> > > >
> > > > Timothee NICOLAS
> > >
> > >
> > > <log_summary.txt>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150826/bdf05401/attachment.html>