[petsc-users] MPI Iterative solver crash on HPC

Sal Am tempohoper at gmail.com
Fri Jan 11 07:44:22 CST 2019


Using a larger problem set with 2B non-zero elements and a matrix of 25M x
25M I get the following error:
[4]PETSC ERROR:
------------------------------------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[4]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[4]PETSC ERROR: likely location of problem given in stack below
[4]PETSC ERROR: ---------------------  Stack Frames
------------------------------------
[4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[4]PETSC ERROR:       INSTEAD the line number of the start of the function
[4]PETSC ERROR:       is given.
[4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422
/lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
[4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747
/lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable
line 1256 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156
/lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMult line 9950
/lustre/home/vef002/petsc/src/mat/interface/matrix.c
[4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871
/lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
[4]PETSC ERROR: [4] PCSetUp_GAMG line 428
/lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
[4]PETSC ERROR: [4] PCSetUp line 894
/lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
[4]PETSC ERROR: [4] KSPSetUp line 304
/lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
[4]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
[4]PETSC ERROR: Signal received
[4]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
[4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by vef002
Fri Jan 11 09:13:23 2019
[4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug
--with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
--with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
--with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
--download-parmetis --download-metis --download-ptscotch
--download-superlu_dist --download-mumps --with-scalar-type=complex
--with-debugging=yes --download-scalapack --download-superlu
--download-fblaslapack=1 --download-cmake
[4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind

Using Valgrind on only one of the valgrind files the following error was
written:

==9053== Invalid read of size 4
==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)
==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ
(matmatmult.c:790)
==9053==    by 0x5D106F8:
MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ
(mpimatmatmult.c:1186)
==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
==9053==  Address 0x8386997f4 is not stack'd, malloc'd or (recently) free'd
==9053==


On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempohoper at gmail.com> wrote:

> Thank you Dave,
>
> I reconfigured PETSc with valgrind and debugging mode, I ran the code
> again with the following options:
> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type
> gamg -log_view
> (as on the petsc website you linked)
>
> It finished solving using the iterative solver, but the resulting
> valgrind.log.%p files (all 8 corresponding to each processor) are all
> empty. And it took a whooping ~15hours, for what used to take ~10-20min.
> Maybe this is because of valgrind? I am not sure. Attached is the log_view.
>
>
> On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhem23 at gmail.com> wrote:
>
>>
>>
>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>>> I am not sure what is exactly is wrong as the error changes slightly
>>> every time I run it (without changing the parameters).
>>>
>>
>> This likely implies that you have a memory error in your code (a memory
>> leak would not cause this behaviour).
>> I strongly suggest you make sure your code is free of memory errors.
>> You can do this using valgrind. See here
>>
>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>
>> for an explanation of how to use valgrind.
>>
>>
>>> I have attached the first two run's errors and my code.
>>>
>>> Is there a memory leak somewhere? I have tried running it with
>>> -malloc_dump, but not getting anything printed out, however, when run with
>>> -log_view I see that Viewer is created 4 times, but destroyed 3 times. The
>>> way I see it, I have destroyed it where I see I no longer have use for it
>>> so not sure if I am wrong. Could this be the reason why it keeps crashing?
>>> It crashes as soon as it reads the matrix, before entering the solving mode
>>> (I have a print statement before solving starts that never prints).
>>>
>>> how I run it in the job script on 2 node with 32 processors using the
>>> clusters OpenMPI.
>>>
>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg -ksp_converged_reason
>>> -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged
>>> -ksp_monitor -malloc_log -ksp_view
>>>
>>> the matrix:
>>> 2 122 821 366 (non-zero elements)
>>> 25 947 279 x 25 947 279
>>>
>>> Thanks and all the best
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190111/1cd6cf0c/attachment.html>


More information about the petsc-users mailing list