[petsc-users] MPI Iterative solver crash on HPC

Matthew Knepley knepley at gmail.com
Wed Jan 16 07:12:13 CST 2019


On Wed, Jan 16, 2019 at 3:52 AM Sal Am via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> The memory requested is an insane number. You may need to use 64 bit
>> integers.
>
> Thanks Mark, I reconfigured it to use 64bit, however in the process it
> says I can no longer use MUMPS and SuperLU as they are not supported (I see
> on MUMPS webpage it supports 64int). However it does not exactly solve the
> problem.
>

SuperLU_dist supports 64-bit ints. Are you not running in parallel?


> This time, it crashes at
>> [6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989
>> in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>> ierr      = PetscMalloc1(bi[pn]+1,&bj);
>> which allocates local portion of B^T*A.
>> You may also try to increase number of cores to reduce local matrix size.
>>
>
> So I increased the number of cores to 16 on one node and ran it by :
> mpiexec valgrind --tool=memcheck -q --num-callers=20
> --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs
> -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
> It crashed after reading in the matrix and before starting to solve. The
> error:
>
> [15]PETSC ERROR: [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [0]PETSC ERROR: [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [2]PETSC ERROR:
> ------------------------------------------------------------------------
> [2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [3]PETSC ERROR:
> ------------------------------------------------------------------------
> [4]PETSC ERROR:
> ------------------------------------------------------------------------
> [4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR:
> ------------------------------------------------------------------------
> [8]PETSC ERROR:
> ------------------------------------------------------------------------
> [12]PETSC ERROR:
> ------------------------------------------------------------------------
> [12]PETSC ERROR: [14]PETSC ERROR:
> ------------------------------------------------------------------------
> [14]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 10 with PID 0 on node r03n01 exited on
> signal 9 (Killed).
>
> Now I was running this with valgrind as someone had previously suggested
> and the 16 files created all contain the same type of error:
>

Okay, its possible that there are bugs in the MPI implementation. So

1) Try using    -build_twosided allreduce    on this run

2) Is it possible to get something that fails here but we can run. None of
our tests show this problem.

  Thanks,

     Matt


> ==25940== Invalid read of size 8
> ==25940==    at 0x5103326: PetscCheckPointer (checkptr.c:81)
> ==25940==    by 0x4F42058: PetscCommGetNewTag (tagm.c:77)
> ==25940==    by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier
> (mpits.c:373)
> ==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
> ==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
> ==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
> ==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
> ==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
> ==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
> ==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
> ==25940==  Address 0x19f807fc is 12 bytes inside a block of size 16 alloc'd
> ==25940==    at 0x4C2A603: memalign (vg_replace_malloc.c:899)
> ==25940==    by 0x4FD0B0E: PetscMallocAlign (mal.c:41)
> ==25940==    by 0x4FD23E7: PetscMallocA (mal.c:397)
> ==25940==    by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier
> (mpits.c:371)
> ==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
> ==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
> ==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
> ==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
> ==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
> ==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
> ==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
> ==25940==
>
>
> On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong <hzhang at mcs.anl.gov> wrote:
>
>> Fande:
>>
>>> According to this PR
>>> https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff
>>>
>>> Should we set the scalable algorithm as default?
>>>
>> Sure, we can. But I feel we need do more tests to compare scalable and
>> non-scalable algorithms.
>> On theory, for small to medium matrices, non-scalable matmatmult()
>> algorithm enables more efficient
>> data accessing. Andreas optimized scalable implementation. Our
>> non-scalable implementation might have room to be further optimized.
>> Hong
>>
>>>
>>> On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>> Add option '-mattransposematmult_via scalable'
>>>> Hong
>>>>
>>>> On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users <
>>>> petsc-users at mcs.anl.gov> wrote:
>>>>
>>>>> I saw the following error message in your first email.
>>>>>
>>>>> [0]PETSC ERROR: Out of memory. This could be due to allocating
>>>>> [0]PETSC ERROR: too large an object or bleeding by not properly
>>>>> [0]PETSC ERROR: destroying unneeded objects.
>>>>>
>>>>> Probably the matrix is too large. You can try with more compute nodes,
>>>>> for example, use 8 nodes instead of 2, and see what happens.
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <
>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>
>>>>>> Using a larger problem set with 2B non-zero elements and a matrix of
>>>>>> 25M x 25M I get the following error:
>>>>>> [4]PETSC ERROR:
>>>>>> ------------------------------------------------------------------------
>>>>>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>>>>> probably memory access out of range
>>>>>> [4]PETSC ERROR: Try option -start_in_debugger or
>>>>>> -on_error_attach_debugger
>>>>>> [4]PETSC ERROR: or see
>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple
>>>>>> Mac OS X to find memory corruption errors
>>>>>> [4]PETSC ERROR: likely location of problem given in stack below
>>>>>> [4]PETSC ERROR: ---------------------  Stack Frames
>>>>>> ------------------------------------
>>>>>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>>>> available,
>>>>>> [4]PETSC ERROR:       INSTEAD the line number of the start of the
>>>>>> function
>>>>>> [4]PETSC ERROR:       is given.
>>>>>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422
>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
>>>>>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747
>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
>>>>>> [4]PETSC ERROR: [4]
>>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256
>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156
>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950
>>>>>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c
>>>>>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871
>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
>>>>>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428
>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
>>>>>> [4]PETSC ERROR: [4] PCSetUp line 894
>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
>>>>>> [4]PETSC ERROR: [4] KSPSetUp line 304
>>>>>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>> [4]PETSC ERROR: --------------------- Error Message
>>>>>> --------------------------------------------------------------
>>>>>> [4]PETSC ERROR: Signal received
>>>>>> [4]PETSC ERROR: See
>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>>> shooting.
>>>>>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
>>>>>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by
>>>>>> vef002 Fri Jan 11 09:13:23 2019
>>>>>> [4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug
>>>>>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
>>>>>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
>>>>>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
>>>>>> --download-parmetis --download-metis --download-ptscotch
>>>>>> --download-superlu_dist --download-mumps --with-scalar-type=complex
>>>>>> --with-debugging=yes --download-scalapack --download-superlu
>>>>>> --download-fblaslapack=1 --download-cmake
>>>>>> [4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
>>>>>> with errorcode 59.
>>>>>>
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> [0]PETSC ERROR:
>>>>>> ------------------------------------------------------------------------
>>>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
>>>>>> the batch system) has told this process to end
>>>>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>>>>> -on_error_attach_debugger
>>>>>> [0]PETSC ERROR: or see
>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>
>>>>>> Using Valgrind on only one of the valgrind files the following error
>>>>>> was written:
>>>>>>
>>>>>> ==9053== Invalid read of size 4
>>>>>> ==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)
>>>>>> ==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ
>>>>>> (matmatmult.c:790)
>>>>>> ==9053==    by 0x5D106F8:
>>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
>>>>>> ==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ
>>>>>> (mpimatmatmult.c:1186)
>>>>>> ==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
>>>>>> ==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
>>>>>> ==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
>>>>>> ==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
>>>>>> ==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
>>>>>> ==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
>>>>>> ==9053==  Address 0x8386997f4 is not stack'd, malloc'd or (recently)
>>>>>> free'd
>>>>>> ==9053==
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempohoper at gmail.com> wrote:
>>>>>>
>>>>>>> Thank you Dave,
>>>>>>>
>>>>>>> I reconfigured PETSc with valgrind and debugging mode, I ran the
>>>>>>> code again with the following options:
>>>>>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
>>>>>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type
>>>>>>> gamg -log_view
>>>>>>> (as on the petsc website you linked)
>>>>>>>
>>>>>>> It finished solving using the iterative solver, but the resulting
>>>>>>> valgrind.log.%p files (all 8 corresponding to each processor) are all
>>>>>>> empty. And it took a whooping ~15hours, for what used to take ~10-20min.
>>>>>>> Maybe this is because of valgrind? I am not sure. Attached is the log_view.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhem23 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <
>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> I am not sure what is exactly is wrong as the error changes
>>>>>>>>> slightly every time I run it (without changing the parameters).
>>>>>>>>>
>>>>>>>>
>>>>>>>> This likely implies that you have a memory error in your code (a
>>>>>>>> memory leak would not cause this behaviour).
>>>>>>>> I strongly suggest you make sure your code is free of memory errors.
>>>>>>>> You can do this using valgrind. See here
>>>>>>>>
>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>
>>>>>>>> for an explanation of how to use valgrind.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have attached the first two run's errors and my code.
>>>>>>>>>
>>>>>>>>> Is there a memory leak somewhere? I have tried running it with
>>>>>>>>> -malloc_dump, but not getting anything printed out, however, when run with
>>>>>>>>> -log_view I see that Viewer is created 4 times, but destroyed 3 times. The
>>>>>>>>> way I see it, I have destroyed it where I see I no longer have use for it
>>>>>>>>> so not sure if I am wrong. Could this be the reason why it keeps crashing?
>>>>>>>>> It crashes as soon as it reads the matrix, before entering the solving mode
>>>>>>>>> (I have a print statement before solving starts that never prints).
>>>>>>>>>
>>>>>>>>> how I run it in the job script on 2 node with 32 processors using
>>>>>>>>> the clusters OpenMPI.
>>>>>>>>>
>>>>>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg
>>>>>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view
>>>>>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view
>>>>>>>>>
>>>>>>>>> the matrix:
>>>>>>>>> 2 122 821 366 (non-zero elements)
>>>>>>>>> 25 947 279 x 25 947 279
>>>>>>>>>
>>>>>>>>> Thanks and all the best
>>>>>>>>>
>>>>>>>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190116/8cce6c56/attachment-0001.html>


More information about the petsc-users mailing list