[petsc-users] MUMPS error and superLU error

Mon Jun 22 15:29:56 CDT 2015

Venkatesh,
You may also test superlu_dist, which may use less memory.
Hong

On Mon, Jun 22, 2015 at 12:43 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
>   There is nothing we can really do to help on the PETSc side. I do note
> from the output
>
>  REDISTRIB: TOTAL DATA LOCAL/SENT         =   328575589  1437471711
>  GLOBAL TIME FOR MATRIX DISTRIBUTION       =    206.6792
>  ** Memory relaxation parameter ( ICNTL(14)  )            :        35
>  ** Rank of processor needing largest memory in facto     :        30
>  ** Space in MBYTES used by this processor for facto      :     21593
>  ** Avg. Space in MBYTES per working proc during facto    :      7708
>
> some processes (like 30) require three times as much memory as other
> processes so perhaps a better load balancing of the matrix during the
> factorization would help with memory usage.
>
>   Barry
>
>
> > On Jun 22, 2015, at 10:57 AM, venkatesh g <venkateshgk.j at gmail.com>
> wrote:
> >
> > Hi
> > I have restructured my matrix eigenvalue problem to see why B is
> singular as you suggested by changing the governing equations in different
> form.
> >
> > Now my matrix B is not singular. Both A and B are invertible in
> Ax=lambda Bx.
> >
> > Still I receive error in MUMPS as it uses large memory (attached is the
> error log)
> >
> > I gave the command: aprun -n 240 -N 24 ./ex7 -f1 A100t -f2 B100t
> -st_type sinvert -eps_target 0.01 -st_ksp_type preonly -st_pc_type lu
> -st_pc_factor_mat_solver_package mumps -mat_mumps_cntl_1 1e-5
> -mat_mumps_icntl_4 2 -evecs v100t
> >
> > The matrix A is 60% with zeros.
> >
> > Kindly help me.
> >
> > Venkatesh
> >
> > On Sun, May 31, 2015 at 8:04 PM, Hong <hzhang at mcs.anl.gov> wrote:
> > venkatesh,
> >
> > As we discussed previously, even on smaller problems,
> > both mumps and superlu_dist failed, although Mumps gave "OOM" error in
> numerical factorization.
> >
> > You acknowledged that B is singular, which may need additional
> reformulation for your eigenvalue problems. The option '-st_type sinvert'
> likely uses B^{-1} (have you read slepc manual?), which could be the source
> of trouble.
> >
> > Please investigate your model, understand why B is singular; if there is
> a way to dump null space before submitting large size simulation.
> >
> > Hong
> >
> >
> > On Sun, May 31, 2015 at 8:36 AM, Dave May <dave.mayhem23 at gmail.com>
> wrote:
> > It failed due to a lack of memory. "OOM" stands for "out of memory". OOM
> killer terminated your job means you ran out of memory.
> >
> >
> >
> >
> > On Sunday, 31 May 2015, venkatesh g <venkateshgk.j at gmail.com> wrote:
> > Hi all,
> >
> > I tried to run my Generalized Eigenproblem in 120 x 24 = 2880 cores.
> > The matrix size of A = 20GB and B = 5GB.
> >
> > It got killed after 7 Hrs of run time. Please see the mumps error log.
> Why must it fail ?
> > I gave the command:
> >
> > aprun -n 240 -N 24 ./ex7 -f1 a110t -f2 b110t -st_type sinvert -eps_nev 1
> -log_summary -st_ksp_type preonly -st_pc_type lu
> -st_pc_factor_mat_solver_package mumps -mat_mumps_cntl_1 1e-2
> >
> > Kindly let me know.
> >
> > cheers,
> > Venkatesh
> >
> > On Fri, May 29, 2015 at 10:46 PM, venkatesh g <venkateshgk.j at gmail.com>
> wrote:
> > Hi Matt, users,
> >
> > Thanks for the info. Do you also use Petsc and Slepc with MUMPS ? I get
> into the segmentation error if I increase my matrix size.
> >
> > Can you suggest other software for direct solver for QR in parallel
> since as LU may not be good for a singular B matrix in Ax=lambda Bx ? I am
> attaching the working version mumps log.
> >
> > My matrix size here is around 47000x47000. If I am not wrong, the memory
> usage per core is 272MB.
> >
> > Can you tell me if I am wrong ? or really if its light on memory for
> this matrix ?
> >
> > Thanks
> > cheers,
> > Venkatesh
> >
> > On Fri, May 29, 2015 at 4:00 PM, Matt Landreman <
> matt.landreman at gmail.com> wrote:
> > Dear Venkatesh,
> >
> > As you can see in the error log, you are now getting a segmentation
> fault, which is almost certainly a separate issue from the info(1)=-9
> memory problem you had previously. Here is one idea which may or may not
> help. I've used mumps on the NERSC Edison system, and I found that I
> sometimes get segmentation faults when using the default Intel compiler.
> When I switched to the cray compiler the problem disappeared. So you could
> perhaps try a different compiler if one is available on your system.
> >
> > Matt
> >
> > On May 29, 2015 4:04 AM, "venkatesh g" <venkateshgk.j at gmail.com> wrote:
> > Hi Matt,
> >
> > I did what you told and read the manual of that CNTL parameters. I solve
> for that with CNTL(1)=1e-4. It is working.
> >
> > But it was a test matrix with size 46000x46000. Actual matrix size is
> 108900x108900 and will increase in the future.
> >
> > I get this error of memory allocation failed. And the binary matrix size
> of A is 20GB and B is 5 GB.
> >
> > Now I submit this in 240 processors each 4 GB RAM and also in 128
> Processors with total 512 GB RAM.
> >
> > In both the cases, it fails with the following error like memory is not
> enough. But for 90000x90000 size it had run serially in Matlab with <256 GB
> RAM.
> >
> > Kindly let me know.
> >
> > Venkatesh
> >
> > On Tue, May 26, 2015 at 8:02 PM, Matt Landreman <
> matt.landreman at gmail.com> wrote:
> > Hi Venkatesh,
> >
> > I've struggled a bit with mumps memory allocation too.  I think the
> behavior of mumps is roughly the following. First, in the "analysis step",
> mumps computes a minimum memory required based on the structure of nonzeros
> in the matrix.  Then when it actually goes to factorize the matrix, if it
> ever encounters an element smaller than CNTL(1) (default=0.01) in the
> diagonal of a sub-matrix it is trying to factorize, it modifies the
> ordering to avoid the small pivot, which increases the fill-in (hence
> memory needed).  ICNTL(14) sets the margin allowed for this unanticipated
> fill-in.  Setting ICNTL(14)=200000 as in your email is not the solution,
> since this means mumps asks for a huge amount of memory at the start.
> Better would be to lower CNTL(1) or (I think) use static pivoting
> (CNTL(4)).  Read the section in the mumps manual about these CNTL
> parameters. I typically set CNTL(1)=1e-6, which eliminated all the
> INFO(1)=-9 errors for my problem, without having to modify ICNTL(14).
> >
> > Also, I recommend running with ICNTL(4)=3 to display diagnostics. Look
> for the line in standard output that says "TOTAL     space in MBYTES for IC
> factorization".  This is the amount of memory that mumps is trying to
> allocate, and for the default ICNTL(14), it should be similar to matlab's
> need.
> >
> > Hope this helps,
> > -Matt Landreman
> > University of Maryland
> >
> > On Tue, May 26, 2015 at 10:03 AM, venkatesh g <venkateshgk.j at gmail.com>
> wrote:
> > I posted a while ago in MUMPS forums but no one seems to reply.
> >
> > I am solving a large generalized Eigenvalue problem.
> >
> > I am getting the following error which is attached, after giving the
> command:
> >
> > /cluster/share/venkatesh/petsc-3.5.3/linux-gnu/bin/mpiexec -np 64 -hosts
> compute-0-4,compute-0-6,compute-0-7,compute-0-8 ./ex7 -f1 a72t -f2 b72t
> -st_type sinvert -eps_nev 3 -eps_target 0.5 -st_ksp_type preonly
> -st_pc_type lu -st_pc_factor_mat_solver_package mumps -mat_mumps_icntl_14
> 200000
> >
> > IT IS impossible to allocate so much memory per processor.. it is asking
> like around 70 GB per processor.
> >
> > A serial job in MATLAB for the same matrices takes < 60GB.
> >
> > After trying out superLU_dist, I have attached the error there also
> (segmentation error).
> >
> > Kindly help me.
> >
> > Venkatesh
> >
> >
> >
> >
> >
> >
> >
> >
> > <mumps_error_log.txt>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150622/118e94c2/attachment-0001.html>