[petsc-users] MPI Iterative solver crash on HPC

Sal Am tempohoper at gmail.com
Wed Jan 16 02:51:07 CST 2019


>
> The memory requested is an insane number. You may need to use 64 bit
> integers.

Thanks Mark, I reconfigured it to use 64bit, however in the process it says
I can no longer use MUMPS and SuperLU as they are not supported (I see on
MUMPS webpage it supports 64int). However it does not exactly solve the
problem.

This time, it crashes at
> [6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989
> in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
> ierr      = PetscMalloc1(bi[pn]+1,&bj);
> which allocates local portion of B^T*A.
> You may also try to increase number of cores to reduce local matrix size.
>

So I increased the number of cores to 16 on one node and ran it by :
mpiexec valgrind --tool=memcheck -q --num-callers=20
--log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs
-pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
It crashed after reading in the matrix and before starting to solve. The
error:

[15]PETSC ERROR: [0]PETSC ERROR:
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
batch system) has told this process to end
[0]PETSC ERROR: [1]PETSC ERROR:
------------------------------------------------------------------------
[2]PETSC ERROR:
------------------------------------------------------------------------
[2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
batch system) has told this process to end
[3]PETSC ERROR:
------------------------------------------------------------------------
[4]PETSC ERROR:
------------------------------------------------------------------------
[4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR:
------------------------------------------------------------------------
[8]PETSC ERROR:
------------------------------------------------------------------------
[12]PETSC ERROR:
------------------------------------------------------------------------
[12]PETSC ERROR: [14]PETSC ERROR:
------------------------------------------------------------------------
[14]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
batch system) has told this process to end
--------------------------------------------------------------------------
mpiexec noticed that process rank 10 with PID 0 on node r03n01 exited on
signal 9 (Killed).

Now I was running this with valgrind as someone had previously suggested
and the 16 files created all contain the same type of error:

==25940== Invalid read of size 8
==25940==    at 0x5103326: PetscCheckPointer (checkptr.c:81)
==25940==    by 0x4F42058: PetscCommGetNewTag (tagm.c:77)
==25940==    by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:373)
==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
==25940==  Address 0x19f807fc is 12 bytes inside a block of size 16 alloc'd
==25940==    at 0x4C2A603: memalign (vg_replace_malloc.c:899)
==25940==    by 0x4FD0B0E: PetscMallocAlign (mal.c:41)
==25940==    by 0x4FD23E7: PetscMallocA (mal.c:397)
==25940==    by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:371)
==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
==25940==


On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong <hzhang at mcs.anl.gov> wrote:

> Fande:
>
>> According to this PR
>> https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff
>>
>> Should we set the scalable algorithm as default?
>>
> Sure, we can. But I feel we need do more tests to compare scalable and
> non-scalable algorithms.
> On theory, for small to medium matrices, non-scalable matmatmult()
> algorithm enables more efficient
> data accessing. Andreas optimized scalable implementation. Our
> non-scalable implementation might have room to be further optimized.
> Hong
>
>>
>> On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users <
>> petsc-users at mcs.anl.gov> wrote:
>>
>>> Add option '-mattransposematmult_via scalable'
>>> Hong
>>>
>>> On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>> I saw the following error message in your first email.
>>>>
>>>> [0]PETSC ERROR: Out of memory. This could be due to allocating
>>>> [0]PETSC ERROR: too large an object or bleeding by not properly
>>>> [0]PETSC ERROR: destroying unneeded objects.
>>>>
>>>> Probably the matrix is too large. You can try with more compute nodes,
>>>> for example, use 8 nodes instead of 2, and see what happens.
>>>>
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <
>>>> petsc-users at mcs.anl.gov> wrote:
>>>>
>>>>> Using a larger problem set with 2B non-zero elements and a matrix of
>>>>> 25M x 25M I get the following error:
>>>>> [4]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>>>> probably memory access out of range
>>>>> [4]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> [4]PETSC ERROR: or see
>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>>>>> OS X to find memory corruption errors
>>>>> [4]PETSC ERROR: likely location of problem given in stack below
>>>>> [4]PETSC ERROR: ---------------------  Stack Frames
>>>>> ------------------------------------
>>>>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>>> available,
>>>>> [4]PETSC ERROR:       INSTEAD the line number of the start of the
>>>>> function
>>>>> [4]PETSC ERROR:       is given.
>>>>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422
>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
>>>>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747
>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
>>>>> [4]PETSC ERROR: [4]
>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256
>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156
>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950
>>>>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c
>>>>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871
>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
>>>>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428
>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
>>>>> [4]PETSC ERROR: [4] PCSetUp line 894
>>>>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
>>>>> [4]PETSC ERROR: [4] KSPSetUp line 304
>>>>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
>>>>> [4]PETSC ERROR: --------------------- Error Message
>>>>> --------------------------------------------------------------
>>>>> [4]PETSC ERROR: Signal received
>>>>> [4]PETSC ERROR: See
>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>> shooting.
>>>>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
>>>>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by
>>>>> vef002 Fri Jan 11 09:13:23 2019
>>>>> [4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug
>>>>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
>>>>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
>>>>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
>>>>> --download-parmetis --download-metis --download-ptscotch
>>>>> --download-superlu_dist --download-mumps --with-scalar-type=complex
>>>>> --with-debugging=yes --download-scalapack --download-superlu
>>>>> --download-fblaslapack=1 --download-cmake
>>>>> [4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
>>>>> with errorcode 59.
>>>>>
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> [0]PETSC ERROR:
>>>>> ------------------------------------------------------------------------
>>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
>>>>> the batch system) has told this process to end
>>>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>>>> -on_error_attach_debugger
>>>>> [0]PETSC ERROR: or see
>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>
>>>>> Using Valgrind on only one of the valgrind files the following error
>>>>> was written:
>>>>>
>>>>> ==9053== Invalid read of size 4
>>>>> ==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)
>>>>> ==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ
>>>>> (matmatmult.c:790)
>>>>> ==9053==    by 0x5D106F8:
>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
>>>>> ==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ
>>>>> (mpimatmatmult.c:1186)
>>>>> ==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
>>>>> ==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
>>>>> ==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
>>>>> ==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
>>>>> ==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
>>>>> ==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
>>>>> ==9053==  Address 0x8386997f4 is not stack'd, malloc'd or (recently)
>>>>> free'd
>>>>> ==9053==
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempohoper at gmail.com> wrote:
>>>>>
>>>>>> Thank you Dave,
>>>>>>
>>>>>> I reconfigured PETSc with valgrind and debugging mode, I ran the code
>>>>>> again with the following options:
>>>>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
>>>>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type
>>>>>> gamg -log_view
>>>>>> (as on the petsc website you linked)
>>>>>>
>>>>>> It finished solving using the iterative solver, but the resulting
>>>>>> valgrind.log.%p files (all 8 corresponding to each processor) are all
>>>>>> empty. And it took a whooping ~15hours, for what used to take ~10-20min.
>>>>>> Maybe this is because of valgrind? I am not sure. Attached is the log_view.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <
>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>> I am not sure what is exactly is wrong as the error changes
>>>>>>>> slightly every time I run it (without changing the parameters).
>>>>>>>>
>>>>>>>
>>>>>>> This likely implies that you have a memory error in your code (a
>>>>>>> memory leak would not cause this behaviour).
>>>>>>> I strongly suggest you make sure your code is free of memory errors.
>>>>>>> You can do this using valgrind. See here
>>>>>>>
>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>
>>>>>>> for an explanation of how to use valgrind.
>>>>>>>
>>>>>>>
>>>>>>>> I have attached the first two run's errors and my code.
>>>>>>>>
>>>>>>>> Is there a memory leak somewhere? I have tried running it with
>>>>>>>> -malloc_dump, but not getting anything printed out, however, when run with
>>>>>>>> -log_view I see that Viewer is created 4 times, but destroyed 3 times. The
>>>>>>>> way I see it, I have destroyed it where I see I no longer have use for it
>>>>>>>> so not sure if I am wrong. Could this be the reason why it keeps crashing?
>>>>>>>> It crashes as soon as it reads the matrix, before entering the solving mode
>>>>>>>> (I have a print statement before solving starts that never prints).
>>>>>>>>
>>>>>>>> how I run it in the job script on 2 node with 32 processors using
>>>>>>>> the clusters OpenMPI.
>>>>>>>>
>>>>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg
>>>>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view
>>>>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view
>>>>>>>>
>>>>>>>> the matrix:
>>>>>>>> 2 122 821 366 (non-zero elements)
>>>>>>>> 25 947 279 x 25 947 279
>>>>>>>>
>>>>>>>> Thanks and all the best
>>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190116/81f50a85/attachment-0001.html>


More information about the petsc-users mailing list