[petsc-users] KSP changes for successive solver

Sun Jul 26 12:11:09 CDT 2015

Hi Barry,

On 26 July 2015 at 05:18, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>   This dmdarepart business, which I am guessing is running PCMG on smaller
> sets of processes with a DMDDA on that smaller set of processes

Yes. The implementation queries the PC for a DMDA, creates a new DMDA on a
communicator of smaller size and attaches this to a KSP which utilizes the
same reduced communicator. Matrices and vectors are permuted into the new
ordering associated the reduced DMDA. By attached the reduced DMDA to the
nested KSP, Galerkin MG can be invoked from the command line.

One notable difference from the subcomm creation routine used by
PCREDUNDANT is that this implementation introduces idle cores.

I specifically wanted to use a smaller communicator, rather than have many
cores with zero local length, because if I want to use a Krylov method
requiring I want the global reductions performed over less cores. For the
network on the Cray XE6 I was using when this was being developed, the
reductions times were absolutely killing my performance at around 8k-16k
cores.

> for a coarse problem is a fine idea but you should keep in mind the rule
> of thumb that that parallel iterative (and even more direct) solvers don't
> do well we there is roughly 10,000 or fewer degrees of freedom per
> processor.

The original motivation for this repartitioning implementation was
precisely to deal with the issue you raise above. In my application, my
subdomains are on the order of 24^3 Q2 elements and with the current DMDA
implementation, this means I can typically can use 4 levels of multi-grid.
For the large jobs (> 4k cores), this results in large coarse grid
problems. The coarse grids would generally consist of a single Q2 element
with at most 375 non-zeros per row (3D, velocity vector)

In my experience, I found the setup cost of associated with PCREDUNDANT +
sparse direct to be too high to be a practical option. In addition, the
elliptic operators from my application possessed large coefficient
variations making convergence of any standard iterative solver
(GMRES/{ASM,BJACOBI}) difficult. Even with PCREDUNDANT on reduced comm
sizes in combination with SUPERLU_DIST or MUMPS I failed to find a strategy
which was both reliable and sufficiently fast.

I should note that I did have a lot of success using GAMG as my coarse
level solver at the scale I wanted to run at (at the time has <16k cores).
At 200k cores however, this might not be the case due to AMG setup times.

However, since GMG is always faster when it is applicable, I really wanted
to be able to continue coarsening until I am at the point where the
Galerkin coarse operators no longer capture my coefficient variations. For
my problem, this is occurring at approximately 6-7 levels of coarsening -
but the constraints imposed by the DMDA partitioning implementation
prevented such a hierarchy from being defined.

I've raised the issue before and the consensus seemed to be that faking a
reduced comm size by having zero length local rows in a Mat/Vec is just
fine. That might be the case, but this philosophy cannot be used
in-conjunction with the current DMDA infrastructure to perform a "full
machine" run (for strong or weak scaling studies) using a geometric
multi-grid implementation whilst simultaneously retaining the fastest, most
scalable solution strategy.

In the end that is what I want to do. :D

I would be happy to contribute a similar repartitioning preconditioner to
petsc.

Cheers,
  Dave

So you should definitely not be using SuperLU_DIST in parallel to solve a
> problem with 1048 degrees of freedom on 128 processes, just use PCREDUNDANT
> and its default (sequential) LU. That should be faster.
>
>   Barry
>
> > On Jul 25, 2015, at 10:09 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >
> >  Don't use
> >
> > -mg_coarse_pc_factor_mat_solver_package superlu_dist
> > -mg_coarse_pc_type lu
> >
> >  with 8000+ processes and 1 degree of freedom per process SuperLU_DIST
> will be terrible. Just leave the defaults for this and send the -log_summary
> >
> >  Barry
> >
> >> On Jul 24, 2015, at 2:44 PM, Michele Rosso <mrosso at uci.edu> wrote:
> >>
> >> Barry,
> >>
> >> I attached ksp_view and log_summary for two different setups:
> >>
> >> 1) Plain MG on 5 levels + LU at the coarse level (files ending in mg5)
> >> 2) Plain MG on 5 levels + custom PC + LU at the coarse level (files
> ending in mg7)
> >>
> >> The custom PC works on a subset of processes, thus allowing to use two
> more levels of MG, for a total of 7.
> >> Case 1) is extremely slow ( ~ 20 sec per solve ) and converges in 21
> iterations.
> >> Case 2) is way faster ( ~ 0.25 sec per solve ) and converges in 29
> iterations.
> >>
> >> Thanks for your help!
> >>
> >> Michele
> >>
> >>
> >> On Fri, 2015-07-24 at 13:56 -0500, Barry Smith wrote:
> >>>  The coarse problem for the PCMG (geometric multigrid) is
> >>>
> >>> Mat Object:       8192 MPI processes
> >>>        type: mpiaij
> >>>        rows=8192, cols=8192
> >>>
> >>> then it tries to solve it with algebraic multigrid on 8192 processes
> (which is completely insane). A lot of the time is spent in setting up the
> algebraic multigrid (not surprisingly).
> >>>
> >>> 8192 is kind of small to parallelize.  Please run the same code but
> with the default coarse grid problem instead of PCGAMG and send us the
> -log_summary again
> >>>
> >>>  Barry
> >>>
> >>>
> >>>> On Jul 24, 2015, at 1:35 PM, Michele Rosso <mrosso at uci.edu> wrote:
> >>>>
> >>>> Hi Mark and Barry,
> >>>>
> >>>> I am sorry for my late reply: it was a busy week!
> >>>> I run a test case for a larger problem with  as many levels (i.e. 5)
> of MG I could and  GAMG as PC at the coarse level. I attached the output of
> info ( after grep for "gmag"),  ksp_view and log_summary.
> >>>> The solve takes about 2 seconds on 8192 cores, which is way too much.
> The number of iterations to convergence is 24.
> >>>> I hope there is a way to speed it up.
> >>>>
> >>>> Thanks,
> >>>> Michele
> >>>>
> >>>>
> >>>> On Fri, 2015-07-17 at 09:38 -0400, Mark Adams wrote:
> >>>>>
> >>>>>
> >>>>> On Thu, Jul 16, 2015 at 8:18 PM, Michele Rosso <mrosso at uci.edu>
> wrote:
> >>>>> Barry,
> >>>>>
> >>>>> thank you very much for the detailed answer.  I tried what you
> suggested and it works.
> >>>>> So far I tried on a small system but the final goal is to use it for
> very large runs.  How does  PCGAMG compares to PCMG  as far as performances
> and scalability are concerned?
> >>>>> Also, could you help me to tune the GAMG part ( my current setup is
> in the attached ksp_view.txt file )?
> >>>>>
> >>>>>
> >>>>>
> >>>>> I am going to add this to the document today but you can run with
> -info.  This is very noisy so you might want to do the next step at run
> time.  Then grep on GAMG.  This will be about 20 lines.  Send that to us
> and we can go from there.
> >>>>>
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> I also tried to use superlu_dist for the LU decomposition on
> mg_coarse_mg_sub_
> >>>>> -mg_coarse_mg_coarse_sub_pc_type lu
> >>>>> -mg_coarse_mg_coarse_sub_pc_factor_mat_solver_package superlu_dist
> >>>>>
> >>>>> but I got an error:
> >>>>>
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> ****** Error in MC64A/AD. INFO(1) = -2
> >>>>> symbfact() error returns 0
> >>>>> symbfact() error returns 0
> >>>>> symbfact() error returns 0
> >>>>> symbfact() error returns 0
> >>>>> symbfact() error returns 0
> >>>>> symbfact() error returns 0
> >>>>> symbfact() error returns 0
> >>>>>
> >>>>>
> >>>>> Thank you,
> >>>>> Michele
> >>>>>
> >>>>>
> >>>>> On Thu, 2015-07-16 at 18:07 -0500, Barry Smith wrote:
> >>>>>>
> >>>>>>> On Jul 16, 2015, at 5:42 PM, Michele Rosso <mrosso at uci.edu> wrote:
> >>>>>>>
> >>>>>>> Barry,
> >>>>>>>
> >>>>>>> thanks for your reply. So if I want it fixed, I will have to use
> the master branch, correct?
> >>>>>>
> >>>>>>
> >>>>>>  Yes, or edit mg.c and remove the offending lines of code (easy
> enough).
> >>>>>>
> >>>>>>>
> >>>>>>> On a side note, what I am trying to achieve is to be able to use
> how many levels of MG I want, despite the limitation imposed by the local
> number of grid nodes.
> >>>>>>
> >>>>>>
> >>>>>>   I assume you are talking about with DMDA? There is no generic
> limitation for PETSc's multigrid, it is only with the way the DMDA code
> figures out the interpolation that causes a restriction.
> >>>>>>
> >>>>>>
> >>>>>>> So far I am using a borrowed code that implements a PC that
> creates a sub communicator and perform MG on it.
> >>>>>>> While reading the documentation I found out that PCMGSetLevels
> takes in an optional array of communicators. How does this work?
> >>>>>>
> >>>>>>
> >>>>>>   It doesn't work. It was an idea that never got pursued.
> >>>>>>
> >>>>>>
> >>>>>>> Can I can simply define my matrix and rhs on the fine grid as I
> would do normally ( I do not use kspsetoperators and kspsetrhs ) and KSP
> would take care of it by using the correct communicator for each level?
> >>>>>>
> >>>>>>
> >>>>>>   No.
> >>>>>>
> >>>>>>   You can use the PCMG geometric multigrid with DMDA for as many
> levels as it works and then use PCGAMG as the coarse grid solver. PCGAMG
> automatically uses fewer processes for the coarse level matrices and
> vectors. You could do this all from the command line without writing code.
> >>>>>>
> >>>>>>   For example if your code uses a DMDA and calls KSPSetDM() use for
> example -da_refine 3 -pc_type mg -pc_mg_galerkin -mg_coarse_pc_type gamg
> -ksp_view
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  Barry
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Michele
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, 2015-07-16 at 17:30 -0500, Barry Smith wrote:
> >>>>>>>>   Michel,
> >>>>>>>>
> >>>>>>>>    This is a very annoying feature that has been fixed in master
> >>>>>>>> http://www.mcs.anl.gov/petsc/developers/index.html
> >>>>>>>>  I would like to have changed it in maint but Jed would have a
> shit-fit :-) since it changes behavior.
> >>>>>>>>
> >>>>>>>>  Barry
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Jul 16, 2015, at 4:53 PM, Michele Rosso <mrosso at uci.edu>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I am performing a series of solves inside a loop. The matrix for
> each solve changes but not enough to justify a rebuilt of the PC at each
> solve.
> >>>>>>>>> Therefore I am using  KSPSetReusePreconditioner to avoid
> rebuilding unless necessary. The solver is CG + MG with a custom  PC at the
> coarse level.
> >>>>>>>>> If KSP is not updated each time, everything works as it is
> supposed to.
> >>>>>>>>> When instead I allow the default PETSc  behavior, i.e. updating
> PC every time the matrix changes, the coarse level KSP , initially set to
> PREONLY, is changed into GMRES
> >>>>>>>>> after the first solve. I am not sure where the problem lies (my
> PC or PETSc), so I would like to have your opinion on this.
> >>>>>>>>> I attached the ksp_view for the 2 successive solve and the
> options stack.
> >>>>>>>>>
> >>>>>>>>> Thanks for your help,
> >>>>>>>>> Michel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> <ksp_view.txt><petsc_options.txt>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> <info.txt><ksp_view.txt><log_gamg.txt>
> >>>
> >>>
> >>>
> >>
> >> <ksp_view_mg5.txt><ksp_view_mg7.txt><log_mg5.txt><log_mg7.txt>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150726/8e525e0e/attachment.html>