[petsc-users] Receiving DIVERGED_PCSETUP_FAILED

Wed Jun 22 09:53:46 CDT 2016

Faraz:

> Just an update, I got this to work by rebuilding petsc with parmetis,
> metis and ptscotch.Then I used these settings for Mumps:
>
>    icntl = 28; ival = 2;
>     ierr = MatMumpsSetIcntl(F,icntl,ival);CHKERRQ(ierr);
>
>     icntl = 29; ival = 1;
>     ierr = MatMumpsSetIcntl(F,icntl,ival);CHKERRQ(ierr);
>

The options use MUMPS  parallel symbolic factorization with ptscotch matrix
ordering.

>
> It still took 4X longer to solve than Intel Pardiso. But after
> re-configuring petsc with-debugging=0, it ran faster. Still slower than
> Pardiso, but only 2X slower.
>

I've seen report that  Intel Pardiso is much faster than mumps, e.g.,
slepc developer Jose sent me following:
With mumps:

MatSolve              16 1.0 1.0962e+01
MatLUFactorSym         1 1.0 3.1131e+00
MatLUFactorNum         1 1.0 2.6120e+00

With mkl_pardiso:

MatSolve              16 1.0 6.4163e-01
MatLUFactorSym         1 1.0 2.4772e+00
MatLUFactorNum         1 1.0 8.6419e-01

However, petsc only interfaces with sequential mkl_pardiso. Did you get
results in parallel or sequential?

Hong

>
>
>
>
> ------------------------------
> *From:* Faraz Hussain <faraz_hussain at yahoo.com>
> *To:* Barry Smith <bsmith at mcs.anl.gov>
> *Cc:* "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>
> *Sent:* Friday, June 10, 2016 5:27 PM
>
> *Subject:* Re: [petsc-users] Receiving DIVERGED_PCSETUP_FAILED
>
> I think the issue is I need to play more with the "parrallel" settings
> here.
>
>
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MATSOLVERMUMPS.html
>
> The example 52.c was based on sequential running. So doing mpiexec -np 48,
> was basically just using one processor. I also never installed parmetis,
> metis or ptscotch.
>
> Will install and adjust the MUMPS settings and hopefully will get it to
> converge this weekend!
>
>
> ------------------------------
> *From:* Barry Smith <bsmith at mcs.anl.gov>
> *To:* Faraz Hussain <faraz_hussain at yahoo.com>
> *Cc:* "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>
> *Sent:* Friday, June 10, 2016 4:02 PM
> *Subject:* Re: [petsc-users] Receiving DIVERGED_PCSETUP_FAILED
>
>
> > On Jun 10, 2016, at 3:56 PM, Faraz Hussain <faraz_hussain at yahoo.com>
> wrote:
> >
> > Thanks for the suggestions. I checked but was not able to find out how
> to change the mumps row pivot threshold of 1e-06. Maybe I will ask on the
> mumps user forum.
>
> This might help.
>
>   ierr = PetscOptionsReal("-mat_mumps_cntl_1","CNTL(1): relative pivoting
> threshold","None",mumps->id.CNTL(1),&mumps->id.CNTL(1),NULL);CHKERRQ(ierr);
>   ierr = PetscOptionsReal("-mat_mumps_cntl_2","CNTL(2): stopping criterion
> of
> refinement","None",mumps->id.CNTL(2),&mumps->id.CNTL(2),NULL);CHKERRQ(ierr);
>   ierr = PetscOptionsReal("-mat_mumps_cntl_3","CNTL(3): absolute pivoting
> threshold","None",mumps->id.CNTL(3),&mumps->id.CNTL(3),NULL);CHKERRQ(ierr);
>   ierr = PetscOptionsReal("-mat_mumps_cntl_4","CNTL(4): value for static
> pivoting","None",mumps->id.CNTL(4),&mumps->id.CNTL(4),NULL);CHKERRQ(ierr);
>
> Note I don't know what they mean so you need to read the mumps docs.
>
> > Regarding:
> >
> >  > You need to look at the condition number just before GMRES reaches
> the restart. It has to start all over again at the restart. So what was the
> estimated condition number at 999 iterations?
> >
> > I ran again and the condition number at 999 iterations is:
> >
> > 999 KSP preconditioned resid norm 5.921717188418e-02 true resid norm
> 5.921717188531e-02 ||r(i)||/||b|| 4.187286380279e-03
> > 999 KSP Residual norm 5.921717188418e-02 % max 1.070338898624e+05 min
> 1.002755075294e-01 max/min 1.067398136390e+06
>
> Ok, so relatively ill-conditioned matrix. But seemingly not terrible.
>
>   Barry
>
>
> >
> >
> > From: Barry Smith <bsmith at mcs.anl.gov>
> > To: Faraz Hussain <faraz_hussain at yahoo.com>
> > Cc: "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>
> > Sent: Thursday, June 9, 2016 5:56 PM
> > Subject: Re: [petsc-users] Receiving DIVERGED_PCSETUP_FAILED
> >
> >
> > > On Jun 9, 2016, at 3:32 PM, Faraz Hussain <faraz_hussain at yahoo.com>
> wrote:
> > >
> > > I have been following ex52.c ksp/ksp/examples/tutorials to use MUMPS
> to directly solve Ax=b. My matrix is symmetric and positive definite. I
> built a small cantilever beam model with matrix of 5000^2 size. It solves
> in 2 seconds and gives correct answer. But when I use a finer mesh of the
> cantilever beam with 3.3 million^2 matrix, I get the following error:
> > >
> > >  Mumps row pivot threshhold = 1e-06
> >
> >  Maybe you can change this to get MUMPS to pivot less aggressively.
> Doing lots of pivoting will require a lot more memory. In theory since it
> is SPD it should not need to pivot at all.
> >
> > >  Mumps determinant = (0., 0.) * 2^0
> > > Linear solve did not converge due to DIVERGED_PCSETUP_FAILED
> iterations 0
> > >              PCSETUP_FAILED due to FACTOR_OUTMEMORY
> > > Norm of error inf. iterations 0
> > >
> > > It runs for more than an hour before aborting with this message. I am
> running it with this command:
> > >
> > > mpiexec -hostfile ./hostfile -np 48 ./ex12 -ksp_converged_reason
> > >
> > > My machines have 24 cpus and 125GB Ram. When I do "top" I see it
> correctly spans 48 processes on 2 nodes. The memory usage of each process
> is no more than 1-2GB. So I do not understand why it gives FACTOR_OUTMEMORY
> ?
> > >
> > > The same matrix solves in under 5 minutes in Intel Pardiso using 24
> cpus on one host.
> >
> >  Mumps may be (likely?) is using a different matrix ordering then Intel
> Pardiso. Unfortunately each of these packages have a different way of
> asking for orderings and different orderings to chose from so you will need
> to look at the details for each package.
> >
> > > I thought maybe mumps thinks it is ill-conditioned? The model does
> converge in the iterative solver in 4000 iterations. I also tried running
> with these options per the FAQ on
> > >
> > > " How can I determine the condition number of a matrix? ".
> > >
> > > mpiexec -hostfile ./hostfile -np 48 ./ex12 -pc_type none -ksp_type
> gmres -ksp_monitor_singular_value -ksp_gmres_restart 1000
> -ksp_converged_reason -ksp_monitor_true_residual
> > >
> > > After 1337 iterations I cancelled it, and the output was:
> >
> >  You need to look at the condition number just before GMRES reaches the
> restart. It has to start all over again at the restart. So what was the
> estimated condition number at 999 iterations?
> >
> >  It could be that Intel Pardiso produced a low quality solution if the
> matrix is ill conditioned. You can run with -ksp_type gmres -ksp_max_it 5
> -ksp_monitor_true_residual with -pc_type lu to see how small the residuals
> are after the "direct" solver.
> >
> >  Barry
> >
> >
> > >
> > > 1337 KSP preconditioned resid norm 5.647402411074e-02 true resid norm
> 5.647402411074e-02 ||r(i)||/||b|| 3.993316540960e-03
> > > 1337 KSP Residual norm 5.647402411074e-02 % max 1.070324243277e+05 min
> 1.220336631740e-01 max/min 8.770729448238e+05
> >
> >
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160622/61ad1ca0/attachment.html>