[petsc-users] MPI Iterative solver crash on HPC

Mark Adams mfadams at lbl.gov
Mon Jan 14 07:41:05 CST 2019


The memory requested is an insane number. You may need to use 64 bit
integers.

On Mon, Jan 14, 2019 at 8:06 AM Sal Am via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> I ran it by: mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
> --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs
> -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
> The error:
>
> [6]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [6]PETSC ERROR: Out of memory. This could be due to allocating
> [6]PETSC ERROR: too large an object or bleeding by not properly
> [6]PETSC ERROR: destroying unneeded objects.
> [6]PETSC ERROR: Memory allocated 0 Memory used by process 39398023168
> [6]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.
> [6]PETSC ERROR: Memory requested 18446744066024411136
> [6]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [6]PETSC ERROR: Petsc Release Version 3.10.2, unknown
> [6]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by
> vef002 Mon Jan 14 08:54:45 2019
> [6]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug
> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
> --download-parmetis --download-metis --download-ptscotch
> --download-superlu_dist --download-mumps --with-scalar-type=complex
> --with-debugging=yes --download-scalapack --download-superlu
> --download-fblaslapack=1 --download-cmake
> [6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989
> in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
> [6]PETSC ERROR: #2 PetscMallocA() line 397 in
> /lustre/home/vef002/petsc/src/sys/memory/mal.c
> [6]PETSC ERROR: #3 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989
> in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
> [6]PETSC ERROR: #4 MatTransposeMatMult_MPIAIJ_MPIAIJ() line 1203 in
> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
> [6]PETSC ERROR: #5 MatTransposeMatMult() line 9984 in
> /lustre/home/vef002/petsc/src/mat/interface/matrix.c
> [6]PETSC ERROR: #6 PCGAMGCoarsen_AGG() line 882 in
> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
> [6]PETSC ERROR: #7 PCSetUp_GAMG() line 522 in
> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
> [6]PETSC ERROR: #8 PCSetUp() line 932 in
> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
> [6]PETSC ERROR: #9 KSPSetUp() line 391 in
> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
> [6]PETSC ERROR: #10 main() line 68 in
> /home/vef002/debugenv/tests/solveCmplxLinearSys.cpp
> [6]PETSC ERROR: PETSc Option Table entries:
> [6]PETSC ERROR: -ksp_monitor
> [6]PETSC ERROR: -ksp_type bcgs
> [6]PETSC ERROR: -log_view
> [6]PETSC ERROR: -malloc off
> [6]PETSC ERROR: -mattransposematmult_via scalable
> [6]PETSC ERROR: -pc_type gamg
> [6]PETSC ERROR: ----------------End of Error Message -------send entire
> error message to petsc-maint at mcs.anl.gov----------
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 6 in communicator MPI_COMM_WORLD
> with errorcode 55.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
>
> Memory requested error seems astronomical though.... This was done on a
> machine with 500GB of memory, during my last check it was using 30GB
> mem/processor not sure if it increased suddenly. The file size of the
> matrix is 40GB still same matrix
> 2 122 821 366 (non-zero elements)
> 25 947 279 x 25 947 279
>
>
>
> On Fri, Jan 11, 2019 at 5:34 PM Zhang, Hong <hzhang at mcs.anl.gov> wrote:
>
>> Add option '-mattransposematmult_via scalable'
>> Hong
>>
>> On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>>> I saw the following error message in your first email.
>>>
>>> [0]PETSC ERROR: Out of memory. This could be due to allocating
>>> [0]PETSC ERROR: too large an object or bleeding by not properly
>>> [0]PETSC ERROR: destroying unneeded objects.
>>>
>>> Probably the matrix is too large. You can try with more compute nodes,
>>> for example, use 8 nodes instead of 2, and see what happens.
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>> Using a larger problem set with 2B non-zero elements and a matrix of
>>>> 25M x 25M I get the following error:
>>>> [4]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>>> probably memory access out of range
>>>> [4]PETSC ERROR: Try option -start_in_debugger or
>>>> -on_error_attach_debugger
>>>> [4]PETSC ERROR: or see
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>>> OS X to find memory corruption errors
>>>> [4]PETSC ERROR: likely location of problem given in stack below
>>>> [4]PETSC ERROR: ---------------------  Stack Frames
>>>> ------------------------------------
>>>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>> available,
>>>> [4]PETSC ERROR:       INSTEAD the line number of the start of the
>>>> function
>>>> [4]PETSC ERROR:       is given.
>>>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422
>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
>>>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747
>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
>>>> [4]PETSC ERROR: [4]
>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256
>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156
>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950
>>>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c
>>>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871
>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
>>>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428
>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
>>>> [4]PETSC ERROR: [4] PCSetUp line 894
>>>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
>>>> [4]PETSC ERROR: [4] KSPSetUp line 304
>>>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
>>>> [4]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [4]PETSC ERROR: Signal received
>>>> [4]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
>>>> for trouble shooting.
>>>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
>>>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by
>>>> vef002 Fri Jan 11 09:13:23 2019
>>>> [4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug
>>>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
>>>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
>>>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
>>>> --download-parmetis --download-metis --download-ptscotch
>>>> --download-superlu_dist --download-mumps --with-scalar-type=complex
>>>> --with-debugging=yes --download-scalapack --download-superlu
>>>> --download-fblaslapack=1 --download-cmake
>>>> [4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>>>>
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
>>>> with errorcode 59.
>>>>
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>>
>>>> --------------------------------------------------------------------------
>>>> [0]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
>>>> batch system) has told this process to end
>>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>>> -on_error_attach_debugger
>>>> [0]PETSC ERROR: or see
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>
>>>> Using Valgrind on only one of the valgrind files the following error
>>>> was written:
>>>>
>>>> ==9053== Invalid read of size 4
>>>> ==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)
>>>> ==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ
>>>> (matmatmult.c:790)
>>>> ==9053==    by 0x5D106F8:
>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
>>>> ==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ
>>>> (mpimatmatmult.c:1186)
>>>> ==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
>>>> ==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
>>>> ==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
>>>> ==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
>>>> ==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
>>>> ==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
>>>> ==9053==  Address 0x8386997f4 is not stack'd, malloc'd or (recently)
>>>> free'd
>>>> ==9053==
>>>>
>>>>
>>>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempohoper at gmail.com> wrote:
>>>>
>>>>> Thank you Dave,
>>>>>
>>>>> I reconfigured PETSc with valgrind and debugging mode, I ran the code
>>>>> again with the following options:
>>>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
>>>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type
>>>>> gamg -log_view
>>>>> (as on the petsc website you linked)
>>>>>
>>>>> It finished solving using the iterative solver, but the resulting
>>>>> valgrind.log.%p files (all 8 corresponding to each processor) are all
>>>>> empty. And it took a whooping ~15hours, for what used to take ~10-20min.
>>>>> Maybe this is because of valgrind? I am not sure. Attached is the log_view.
>>>>>
>>>>>
>>>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhem23 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <
>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> I am not sure what is exactly is wrong as the error changes slightly
>>>>>>> every time I run it (without changing the parameters).
>>>>>>>
>>>>>>
>>>>>> This likely implies that you have a memory error in your code (a
>>>>>> memory leak would not cause this behaviour).
>>>>>> I strongly suggest you make sure your code is free of memory errors.
>>>>>> You can do this using valgrind. See here
>>>>>>
>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>
>>>>>> for an explanation of how to use valgrind.
>>>>>>
>>>>>>
>>>>>>> I have attached the first two run's errors and my code.
>>>>>>>
>>>>>>> Is there a memory leak somewhere? I have tried running it with
>>>>>>> -malloc_dump, but not getting anything printed out, however, when run with
>>>>>>> -log_view I see that Viewer is created 4 times, but destroyed 3 times. The
>>>>>>> way I see it, I have destroyed it where I see I no longer have use for it
>>>>>>> so not sure if I am wrong. Could this be the reason why it keeps crashing?
>>>>>>> It crashes as soon as it reads the matrix, before entering the solving mode
>>>>>>> (I have a print statement before solving starts that never prints).
>>>>>>>
>>>>>>> how I run it in the job script on 2 node with 32 processors using
>>>>>>> the clusters OpenMPI.
>>>>>>>
>>>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg
>>>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view
>>>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view
>>>>>>>
>>>>>>> the matrix:
>>>>>>> 2 122 821 366 (non-zero elements)
>>>>>>> 25 947 279 x 25 947 279
>>>>>>>
>>>>>>> Thanks and all the best
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190114/a29d0b6e/attachment.html>


More information about the petsc-users mailing list