[petsc-users] Block preconditioning for 3d problem

Thu Oct 24 10:52:04 CDT 2019

  If you are "throwing things" away in computing the Jacobian then any expectation that Newton will converge is also thrown out the window. 

  If you used the PETSc SNES nonlinear solver you would have much more space to experiment in. For example you could use -snes_mf_operator to apply the "true" matrix-vector product (computed with differencing) while building the preconditioner by "throwing things" away and can still expect convergence of the nonlinear solver.  It also has tools like -snes_jacobian_test that compare your Jacobian to one computed by PETSc to see if there are bugs in your Jacobian code and can compute the full Jacobian with differencing and coloring pretty efficiently.  Also a large set of line search algorithms.

  Newton is not simple (though it looks it) and you should bring power Newton tools to the game, SNES, not home-brew Newton solvers, especially when tackling problems with potentially interesting nonlinearities.

 Barry

> On Oct 14, 2019, at 8:18 PM, Dave Lee <davelee2804 at gmail.com> wrote:
> 
> Hi Barry,
> 
> I've replied inline:
> 
> On Mon, Oct 14, 2019 at 4:07 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> 
>   Thanks. It seems that the iterative method is solved the entire linear system very accurately. For the last linear solve the initial true residual norm is around 10^6 (i.e. the norm of b) and the final one around 10^-10
> 
>   The initial  true residual norm on each block (part of a slab on one process) is around 10^-1 so I don't see a huge difference in scaling between the blocks. The final true residual for blocks is around 10^-17 so each block is well solved.
> 
> Thank you for taking the time to look at this Barry. Glad to know there's nothing too crazy going on here.
>  
> 
>    What do you mean by "quasi-Newton"? Are you doing any kind of line search or just using the full step from the linear solve? 
> 
> I'm using a full step from the linear solve. By "quasi-Newton" I just mean that the Jacobian is only approximate. I've left out and simplified some terms in order to make the Schur complement more tractable
>  
> 
>    When you solve with the single KSP over all slabs the convergence is determined by ||r_all||/||b_all|| < tol while slab by slab it is ||r_slab||/||b_slab|| < tol for each slab. If the slabs had hugely different scalings this could result in different "answers" but since the slabs seem to have very similar scalings the "answers" should be very similar. 
> 
>    I would do the following. Solve the first linear system both ways and then compute the difference in the solution to the linear systems computed both ways. Relative to ||b|| the norm of the difference of the two linear systems solutions should be 10^-15 indicating both approaches produced almost identical answers. If this is the case then the problem is elsewhere, not in the solution of the linear system. On the other hand if the difference is large then likely the problem is in the formulation of the linear system, perhaps the combined linear system is not actually the individual linear systems.
> 
> Thanks for the tip Barry. I tried this, and the normalised difference between the full solve and the slab-wise solve is approximately 1.0e-4 to 1.0e-3 (depending on the slab), so I guess the problem is in how I'm assembling the linear system for the full solve.
> 
> I've had a look into this and haven't noticed anything obvious (but I will keep digging). One difference between the full and slab-wise solves is that for the full solve each processor has num_slabs * num_slab_local_dofs DOFSs, whereas for the slab-wise solves each processor has just num_slab_local_dofs, so I'm wondering if the striding of the slab_local_dofs is causing issues for the full solve. I'm going to try replacing the PCBJACOBI preconditioner with a PCFIELDSPLIT preconditioner, and use PCFieldSplitSetIS() to specify dofs within a given slab for each field, and see if that helps.
>  
> 
>   Barry
> 
> By the way, using a block Jacobi with Jacobi as the preconditioner on each block  and a really tight tolerance for each block is not a practical method for larger problems, though I think it is ok for this test.
> 
> Thanks Barry. I just wanted to do something simple for this test, but I take your point. 
> 
> Cheers, Dave.
> 
> 
> 
> 
> 
> > On Oct 13, 2019, at 6:20 AM, Dave Lee <davelee2804 at gmail.com> wrote:
> > 
> > Hi Barry,
> > 
> > I've answered your questions as follows:
> > 
> >  So each slab lives on all the processes?
> > 
> > Yes, that's correct.
> > 
> > How many processes and how many slabs are you running with?
> > 
> > For my test problem there are 6 processors and 8 slabs
> > 
> > There will be num_slabs blocks per processor times the number of processors. Depending on the layout of the unknowns each block will consist of a entire local slap or each block will consist of pieces from each slap.
> > 
> > The internal slab DOF is the minor index. Each processor is responsible for 144 DOFs per slab, so each processor is responsible for 8x144 DOFs. When I run with -ksp_view there are 8 blocks on each processor (48 total), each with 144 DOF's, so I'm assuming everything is ok there.
> > 
> >  Please run the linear solver with
> > 
> >     -ksp_monitor_true_residual -sub_ksp_monitor_true_residual -ksp_converged_reason -sub_ksp_converged_reason
> > 
> > 
> >   Do you get huge iteration counts somewhere?
> > 
> > Not that I can see, file is attached (log.gmres)
> > 
> > Next run the same thing but add -ksp_type fgmres does that change anything?
> > 
> > I don't think so. The convergence is still the same, file is attached (log.fgmres).
> > 
> > One thing I've noticed - I've been dumping the index of the slab with the maximum normalised error: norm_L2(delta_x) / norm_L2(x). If I solve for each slab separately, then the maximum normalised errors are typically in slab 3 (of 0-7). This makes sense as this is probably the most dynamically active layer of the atmosphere. However if I solve for all the slabs in a single solve (as is the cases for these tests), then the greatest error is always in slab 7, the top of the atmosphere. Obviously the pressure is lowest in this level, so I'm wondering if the GMRES iterations are just weighting for the lower layers of the atmosphere too much.
> > 
> > Cheers, Dave.
> > 
> > On Sun, Oct 13, 2019 at 3:42 AM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
> > 
> > 
> > > On Oct 10, 2019, at 2:58 AM, Dave Lee via petsc-users <petsc-users at mcs.anl.gov> wrote:
> > > 
> > > Hi PETSc,
> > > 
> > > I have a nonlinear 3D problem for a set of uncoupled 2D slabs. (Which I ultimately want to couple once this problem is solved).
> > > 
> > > When I solve the inner linear problem for each of these 2D slabs individually (using KSPGMRES) the convergence of the outer nonlinear problem is good. However when I solve the inner linear problem as a single 3D problem (with no coupling between the 2D slabs, so the matrix is effectively a set of uncoupled blocks, one for each 2D slab) the outer nonlinear convergence degrades dramatically.
> > > 
> > > Note that I am not using SNES, just my own quasi-Newton approach for the outer nonlinear problem.
> > > 
> > > I suspect that the way to recover the convergence for the 3D coupled problem is to use some sort of PCBJACOBI or PCFIELDSPLIT preconditioner, but I'm not entirely sure. I've tried following this example:
> > > https://www.mcs.anl.gov/petsc/petsc-current/src/ksp/ksp/examples/tutorials/ex7.c.html
> > > but without any improvement in the convergence. 
> > > 
> > > On each processor the slab index is the major index and the internal slab DOF index is the minor index. The parallel decomposition is within the 2D slab dimensions only, not between slabs.
> > 
> >   So each slab lives on all the processes?
> > 
> > > 
> > > I'm configuring the 3D inner linear solver as
> > 
> >   How many processes and how many slabs are you running with? 
> > 
> > > 
> > > KSPGetPC(ksp, &pc);
> > > PCSetType(pc, PCBJACOBI);
> > > PCBJacobiSetLocalBlocks(pc, num_slabs, NULL);
> > 
> >     There will be num_slabs blocks per processor times the number of processors. Depending on the layout of the unknowns each block will consist of a entire local slap or each block will consist of pieces from each slap. 
> > 
> > > KSPSetUp(ksp);
> > > PCBJacobiGetSubKSP(pc, &nlocal, &first_local, &subksp);
> > > for(int ii = 0; ii < nlocal; ii++) {
> > >     KSPGetPC(subksp[ii], &subpc);
> > >     PCSetType(subpc, PCJACOBI);
> > >     KSPSetType(subksp[ii], KSPGMRES);
> > >     KSPSetTolerances(subksp[ii], 1.e-12, PETSC_DEFAULT, PETSC_DEFAULT, PETSC_DEFAULT);
> > > }
> > 
> >   Please run the linear solver with 
> > 
> >     -ksp_monitor_true_residual -sub_ksp_monitor_true_residual -ksp_converged_reason -sub_ksp_converged_reason 
> > 
> > 
> >   Do you get huge iteration counts somewhere? 
> > 
> >   Next run the same thing but add -ksp_type fgmres does that change anything?
> > 
> >   Send all the results in an attached file.
> > 
> > Barry
> > 
> > 
> > 
> > > 
> > > However this does not seem to change the outer nonlinear convergence. When I run with ksp_view the local block sizes look correct - number of local 2D slab DOFs for each block on each processor.
> > > 
> > > Any ideas on what sort of KSP/PC configuration I should be using to recover the convergence of the uncoupled 2D slab linear solve for this 3D linear solve? Should I be rethinking my node indexing to help with this?
> > > 
> > > Cheers, Dave.
> > 
> > <log.fgmres><log.gmres>
>