[petsc-users] PETSC errors from KSPSolve() with MUMPS

Hong hzhang at mcs.anl.gov
Thu Aug 28 14:13:47 CDT 2014


Evan,
Please comment out your own mumps parameters and run the code with the
default icnt and ival. Does it still crash? If so, please send us
entire error message. It it common to get memory error in numerical
factorization of mumps. I've rarely seen error occurs in the symbolic
phase.

Hong

On Wed, Aug 27, 2014 at 4:58 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>    Ok
>
> [11]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
>
> This message usually happens because either
>
> 1) the process ran out of memory or
> 2) the process took more time than the batch system allowed
>
> my guess is 1.  I don’t know how MUMPS does its symbolic factorization but my guess is that it may have something in it that is not scalable per node thus causing it to run out of memory. Hong knows more about this and may have advice on how to proceed.
>
> Have you tried superlu_dist on the same problem?
>
>   Barry
>
>
>
>
> On Aug 27, 2014, at 4:52 PM, Evan Um <evanum at gmail.com> wrote:
>
>> Dear Barry,
>>
>> Attached is the whole error message file. Thanks for your help.
>>
>> Evan
>>
>>
>> On Wed, Aug 27, 2014 at 2:44 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>> > MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
>>
>>
>>   Please send ALL the output. In particular since rank 11 seems to have chocked we need to see all the messages from [11] to see what it thinks has gone wrong.
>>
>>    Barry
>>
>> On Aug 27, 2014, at 4:27 PM, Evan Um <evanum at gmail.com> wrote:
>>
>> > Dear PETSC users,
>> >
>> > I try to solve a large problem (about 9,000,000 unknowns) with large number of processes (about 400 processes and 1TB). I guess that this is a reasonably large resource for solving this problem because I was able to solve the same problem using serial MUMPS with 500GB. Of course, it took very long to be computed.
>> > The same code was parallelized with PETSC. However, my code with PETSC suddenly crashes after KSPSolve() successfully calls MUMPS as shown below. If this problem comes from MUMPS, I expect that MUMPS should produce an error report (ICNTL(4)=3), but no error report was not generated. Did anyone have such experience with PETSC+MUMPS? I request comments on its trouble shooting. In advance, I appreciate your help.
>> >
>> > Regards,
>> > Evan
>> >
>> > Codes:
>> >
>> > KSPCreate(PETSC_COMM_WORLD, &ksp);
>> > KSPSetOperators(ksp, A, A);
>> > KSPSetType (ksp, KSPPREONLY);
>> > KSPGetPC(ksp, &pc);
>> > MatSetOption(A, MAT_SPD, PETSC_TRUE);
>> > PCSetType(pc, PCCHOLESKY);
>> > PCFactorSetMatSolverPackage(pc, MATSOLVERMUMPS);
>> > PCFactorSetUpMatSolverPackage(pc);
>> > PCFactorGetMatrix(pc, &F);
>> > KSPSetType(ksp, KSPCG);
>> > MPI_Barrier(MPI_COMM_WORLD);
>> > icntl=29; ival=2; // ParMetis
>> > MatMumpsSetIcntl(F, icntl, ival);
>> > icntl=4; ival=3; // Errors
>> > MatMumpsSetIcntl(F, icntl, ival);
>> > icntl=23; ival=1500;
>> > MatMumpsSetIcntl(F, icntl, ival);
>> > KSPSolve(ksp,b,x);
>> >
>> >
>> >
>> > Errors:
>> >
>> > Entering DMUMPS driver with JOB, N, NZ =   1     9778426              0
>> >  DMUMPS 4.10.0
>> > L D L^T Solver for symmetric positive definite matrices
>> > Type of parallelism: Working host
>> >  ****** ANALYSIS STEP ********
>> > Using ParMETIS for parallel ordering.
>> > Structual symmetry is:100%
>> > --------------------------------------------------------------------------
>> > WARNING: A process refused to die!
>> > Host: n0000.voltaire0
>> > PID:  28131
>> > This process may still be running and/or consuming resources.
>> > --------------------------------------------------------------------------
>> > [n0000.voltaire0:28047] 1 more process has sent help message help-odls-default.txt / odls-default:could-not-kill
>> > [n0000.voltaire0:28047] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> > --------------------------------------------------------------------------
>> > MPI_ABORT was invoked on rank 11 in communicator MPI_COMM_WORLD
>> > with errorcode 59.
>> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> > You may or may not see output from other processes, depending on
>> > exactly when Open MPI kills them.
>> > --------------------------------------------------------------------------
>> > [1]PETSC ERROR: ------------------------------------------------------------------------
>> > [1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
>> > [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> > [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
>> > [1]PETSC ERROR: likely location of problem given in stack below
>> > [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>> > [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>> > [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>> > [1]PETSC ERROR:       is given.
>> > [1]PETSC ERROR: [1] MatCholeskyFactorSymbolic_MUMPS line 1076 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/mat/impls/aij/mpi/mumps/mumps.c
>> > [1]PETSC ERROR: [1] MatCholeskyFactorSymbolic line 2995 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/mat/interface/matrix.c
>> > [1]PETSC ERROR: [1] PCSetUp_Cholesky line 88 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/pc/impls/factor/cholesky/cholesky.c
>> > [1]PETSC ERROR: [1] KSPSetUp line 219 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/ksp/interface/itfunc.c
>> > [1]PETSC ERROR: [1] KSPSolve line 381 /clusterfs/voltaire/home/software/source/petsc-3.5.0/src/ksp/ksp/interface/itfunc.c
>> > [1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>> > [1]PETSC ERROR: Signal received
>> > [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>> > [1]PETSC ERROR: Petsc Release Version 3.5.0, Jun, 30, 2014
>> > [1]PETSC ERROR: fetdem3dp on a arch-linux2-c-debug named n0000.voltaire0 by esum Wed Aug 27 13:48:51 2014
>> > [1]PETSC ERROR: Configure options --prefix=/clusterfs/voltaire/home/software/modules/petsc/3.5.0 --download-fblaslapack=1 --download-mumps=1 --download-parmetis=1 --download-scalapack --download-metis=1 --with-mpi-dir=/global/software/sl-6.x86_64/modules/gcc/4.4.7/openmpi/1.6.5-gcc/
>> > [1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>> > [5]PETSC ERROR: ------------------------------------------------------------------------
>>
>>
>> <slurm-504727.out>
>


More information about the petsc-users mailing list