[petsc-users] MUMPS error and superLU error

Mon Jun 22 12:43:13 CDT 2015

  There is nothing we can really do to help on the PETSc side. I do note from the output

 REDISTRIB: TOTAL DATA LOCAL/SENT         =   328575589  1437471711
 GLOBAL TIME FOR MATRIX DISTRIBUTION       =    206.6792
 ** Memory relaxation parameter ( ICNTL(14)  )            :        35
 ** Rank of processor needing largest memory in facto     :        30
 ** Space in MBYTES used by this processor for facto      :     21593
 ** Avg. Space in MBYTES per working proc during facto    :      7708

some processes (like 30) require three times as much memory as other processes so perhaps a better load balancing of the matrix during the factorization would help with memory usage.

  Barry

> On Jun 22, 2015, at 10:57 AM, venkatesh g <venkateshgk.j at gmail.com> wrote:
> 
> Hi 
> I have restructured my matrix eigenvalue problem to see why B is singular as you suggested by changing the governing equations in different form. 
> 
> Now my matrix B is not singular. Both A and B are invertible in Ax=lambda Bx. 
> 
> Still I receive error in MUMPS as it uses large memory (attached is the error log)
> 
> I gave the command: aprun -n 240 -N 24 ./ex7 -f1 A100t -f2 B100t -st_type sinvert -eps_target 0.01 -st_ksp_type preonly -st_pc_type lu -st_pc_factor_mat_solver_package mumps -mat_mumps_cntl_1 1e-5 -mat_mumps_icntl_4 2 -evecs v100t
> 
> The matrix A is 60% with zeros.
> 
> Kindly help me.
> 
> Venkatesh 
> 
> On Sun, May 31, 2015 at 8:04 PM, Hong <hzhang at mcs.anl.gov> wrote:
> venkatesh,
> 
> As we discussed previously, even on smaller problems, 
> both mumps and superlu_dist failed, although Mumps gave "OOM" error in numerical factorization.
> 
> You acknowledged that B is singular, which may need additional reformulation for your eigenvalue problems. The option '-st_type sinvert' likely uses B^{-1} (have you read slepc manual?), which could be the source of trouble. 
> 
> Please investigate your model, understand why B is singular; if there is a way to dump null space before submitting large size simulation.
> 
> Hong
> 
> 
> On Sun, May 31, 2015 at 8:36 AM, Dave May <dave.mayhem23 at gmail.com> wrote:
> It failed due to a lack of memory. "OOM" stands for "out of memory". OOM killer terminated your job means you ran out of memory.
> 
> 
> 
> 
> On Sunday, 31 May 2015, venkatesh g <venkateshgk.j at gmail.com> wrote:
> Hi all,
> 
> I tried to run my Generalized Eigenproblem in 120 x 24 = 2880 cores. 
> The matrix size of A = 20GB and B = 5GB. 
> 
> It got killed after 7 Hrs of run time. Please see the mumps error log. Why must it fail ? 
> I gave the command: 
> 
> aprun -n 240 -N 24 ./ex7 -f1 a110t -f2 b110t -st_type sinvert -eps_nev 1 -log_summary -st_ksp_type preonly -st_pc_type lu -st_pc_factor_mat_solver_package mumps -mat_mumps_cntl_1 1e-2
> 
> Kindly let me know.
> 
> cheers,
> Venkatesh
> 
> On Fri, May 29, 2015 at 10:46 PM, venkatesh g <venkateshgk.j at gmail.com> wrote:
> Hi Matt, users,
> 
> Thanks for the info. Do you also use Petsc and Slepc with MUMPS ? I get into the segmentation error if I increase my matrix size. 
> 
> Can you suggest other software for direct solver for QR in parallel since as LU may not be good for a singular B matrix in Ax=lambda Bx ? I am attaching the working version mumps log.
> 
> My matrix size here is around 47000x47000. If I am not wrong, the memory usage per core is 272MB.
> 
> Can you tell me if I am wrong ? or really if its light on memory for this matrix ?
> 
> Thanks
> cheers,
> Venkatesh
> 
> On Fri, May 29, 2015 at 4:00 PM, Matt Landreman <matt.landreman at gmail.com> wrote:
> Dear Venkatesh,
> 
> As you can see in the error log, you are now getting a segmentation fault, which is almost certainly a separate issue from the info(1)=-9 memory problem you had previously. Here is one idea which may or may not help. I've used mumps on the NERSC Edison system, and I found that I sometimes get segmentation faults when using the default Intel compiler. When I switched to the cray compiler the problem disappeared. So you could perhaps try a different compiler if one is available on your system.
> 
> Matt
> 
> On May 29, 2015 4:04 AM, "venkatesh g" <venkateshgk.j at gmail.com> wrote:
> Hi Matt,
> 
> I did what you told and read the manual of that CNTL parameters. I solve for that with CNTL(1)=1e-4. It is working. 
> 
> But it was a test matrix with size 46000x46000. Actual matrix size is 108900x108900 and will increase in the future. 
> 
> I get this error of memory allocation failed. And the binary matrix size of A is 20GB and B is 5 GB.
> 
> Now I submit this in 240 processors each 4 GB RAM and also in 128 Processors with total 512 GB RAM.
> 
> In both the cases, it fails with the following error like memory is not enough. But for 90000x90000 size it had run serially in Matlab with <256 GB RAM.
> 
> Kindly let me know.
> 
> Venkatesh
> 
> On Tue, May 26, 2015 at 8:02 PM, Matt Landreman <matt.landreman at gmail.com> wrote:
> Hi Venkatesh,
> 
> I've struggled a bit with mumps memory allocation too.  I think the behavior of mumps is roughly the following. First, in the "analysis step", mumps computes a minimum memory required based on the structure of nonzeros in the matrix.  Then when it actually goes to factorize the matrix, if it ever encounters an element smaller than CNTL(1) (default=0.01) in the diagonal of a sub-matrix it is trying to factorize, it modifies the ordering to avoid the small pivot, which increases the fill-in (hence memory needed).  ICNTL(14) sets the margin allowed for this unanticipated fill-in.  Setting ICNTL(14)=200000 as in your email is not the solution, since this means mumps asks for a huge amount of memory at the start. Better would be to lower CNTL(1) or (I think) use static pivoting (CNTL(4)).  Read the section in the mumps manual about these CNTL parameters. I typically set CNTL(1)=1e-6, which eliminated all the INFO(1)=-9 errors for my problem, without having to modify ICNTL(14).
> 
> Also, I recommend running with ICNTL(4)=3 to display diagnostics. Look for the line in standard output that says "TOTAL     space in MBYTES for IC factorization".  This is the amount of memory that mumps is trying to allocate, and for the default ICNTL(14), it should be similar to matlab's need.
> 
> Hope this helps,
> -Matt Landreman
> University of Maryland
> 
> On Tue, May 26, 2015 at 10:03 AM, venkatesh g <venkateshgk.j at gmail.com> wrote:
> I posted a while ago in MUMPS forums but no one seems to reply.
> 
> I am solving a large generalized Eigenvalue problem. 
> 
> I am getting the following error which is attached, after giving the command:
> 
> /cluster/share/venkatesh/petsc-3.5.3/linux-gnu/bin/mpiexec -np 64 -hosts compute-0-4,compute-0-6,compute-0-7,compute-0-8 ./ex7 -f1 a72t -f2 b72t -st_type sinvert -eps_nev 3 -eps_target 0.5 -st_ksp_type preonly -st_pc_type lu -st_pc_factor_mat_solver_package mumps -mat_mumps_icntl_14 200000
> 
> IT IS impossible to allocate so much memory per processor.. it is asking like around 70 GB per processor. 
> 
> A serial job in MATLAB for the same matrices takes < 60GB. 
> 
> After trying out superLU_dist, I have attached the error there also (segmentation error).
> 
> Kindly help me. 
> 
> Venkatesh
> 
> 
> 
> 
> 
> 
> 
> 
> <mumps_error_log.txt>