[petsc-users] MPI Iterative solver crash on HPC

Matthew Knepley knepley at gmail.com
Tue Jan 29 08:16:14 CST 2019


On Tue, Jan 29, 2019 at 9:07 AM Sal Am via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> Thank you Zhang for that, I am a bit confused by the terminology maybe,
> but when you solved it on 16nodes (each node 128GB) with 576ranks that
> should give 2048GB (~2TB) of memory in total right? Does this mean the MPI
> does not work for SuperLU_dist?
>

No, He is telling you that SuperLU is running out of memory in parallel
(there is memory overhead beyond what it takes on
a single node, and this exhausts the extra). Direct solvers are not memory
scalable, and you have hit the limit.


> Unfortunately I cannot run the code on one node, as our nodes are limited
> to 500GB memory, hence I was hoping I could utilise several nodes and more
> memory.
>
> If anyone else has a solution for an iterative solver please do recommend
> as I am a bit stuck since so far and the combinations I have tried did not
> work.
>

The first step is to reduce your problem to the important linear algebraic
contributions, and then look in the literature
for a solver that worked in that case. Reproducing it in PETSc should
hopefully be easy.

 Thanks,

    Matt


>
> On Fri, Jan 25, 2019 at 5:32 PM Zhang, Junchao <jczhang at mcs.anl.gov>
> wrote:
>
>> Hi, Sal Am,
>>  I did some testes with your matrix and vector. It is a complex matrix
>> with N=4.7M and nnz=417M. Firstly, I tested on a machine with 36 cores and
>> 128GB memory on each compute node. I tried with direct solver and iterative
>> solver but both failed. For example, with 36 ranks on one compute node, I
>> got
>>
>> [9]PETSC ERROR: [9] SuperLU_DIST:pzgssvx line 465
>> /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>> [9]PETSC ERROR: [9] MatLUFactorNumeric_SuperLU_DIST line 314
>> /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>
>>   With 16 nodes, 576 ranks. I got
>>
>> SUPERLU_MALLOC fails for GAstore->rnzval[] at line 240 in file
>> /blues/gpfs/home/jczhang/petsc/bdw-dbg-complex/externalpackages/git.superlu_dist/SRC/pzutil.c
>>
>>   Next, I moved to another single-node machine with 1.5TB memory. It did
>> not fail this time. It ran overnight and is still doing superlu. Using the
>> top command, I found at peak, it consumed almost all memory. At stable
>> period, with 36 ranks, each rank consumed about 20GB memory. When I changed
>> to iterative solvers with -ksp_type bcgs -pc_type gamg
>> -mattransposematmult_via scalable, I did not meet errors seen on the
>> smaller memory machine. But the residual did not converge.
>>   So, I think the errors you met were simply out of memory error either
>> in superlu or in petsc.  If you have machines with large memory, you can
>> try it on. Otherwise, I let other petsc developers suggest better iterative
>> solvers to you.
>>   Thanks.
>>
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Jan 23, 2019 at 2:52 AM Sal Am <tempohoper at gmail.com> wrote:
>>
>>> Sorry it took long had to see if I could shrink down the problem files
>>> from 50GB to something smaller (now ~10GB).
>>>
>>>> Can you compress your matrix and upload it to google drive, so we can
>>>> try to reproduce the error.
>>>>
>>>
>>> How I ran the problem: mpiexec valgrind --tool=memcheck
>>> --suppressions=$HOME/valgrind/valgrind-openmpi.supp -q --num-callers=20
>>> --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres
>>> -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1
>>> -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged
>>>
>>> here is the link to matrix A and vector b:
>>> https://drive.google.com/drive/folders/16YQPTK6TfXC6pV5RMdJ9g7X-ZiqbvwU8?usp=sharing
>>>
>>> I redid the problem (twice) by trying to solve a 1M finite elements
>>> problem corresponding to ~ 4M n and 417M nnz matrix elements on the login
>>> shell which has ~550GB mem, but it failed. First time it failed because of
>>> bus error, second time it said killed. I have attached valgrind file from
>>> both runs.
>>>
>>> OpenMPI is not my favorite. You need to use a suppressions file to get
>>>> rid of all of that noise. Here is one:
>>>>
>>>
>>> Thanks I have been using it, but sometimes I still see same amount of
>>> errors.
>>>
>>>
>>>
>>> On Fri, Jan 18, 2019 at 3:12 AM Zhang, Junchao <jczhang at mcs.anl.gov>
>>> wrote:
>>>
>>>> Usually when I meet a SEGV error, I will run it again with a parallel
>>>> debugger like DDT and wait for it to segfault, and then examine the stack
>>>> trace to see what is wrong.
>>>> Can you compress your matrix and upload it to google drive, so we can
>>>> try to reproduce the error.
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Thu, Jan 17, 2019 at 10:44 AM Sal Am via petsc-users <
>>>> petsc-users at mcs.anl.gov> wrote:
>>>>
>>>>> I did two runs, one with the SuperLU_dist and one with bcgs using
>>>>> jacobi, attached are the results of one of the reports from valgrind on one
>>>>> random processor (out of the 128 files).
>>>>>
>>>>> DS = direct solver
>>>>> IS = iterative  solver
>>>>>
>>>>> There is an awful lot of errors.
>>>>>
>>>>> how I initiated the two runs:
>>>>> mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>>> --log-file=valgrind.log-IS.%p ./solveCSys -malloc off -ksp_type bcgs
>>>>> -pc_type jacobi -mattransposematmult_via scalable -build_twosided allreduce
>>>>> -ksp_monitor -log_view
>>>>>
>>>>> mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>>> --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres
>>>>> -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1
>>>>> -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged
>>>>>
>>>>> Thank you
>>>>>
>>>>> On Thu, Jan 17, 2019 at 4:24 PM Matthew Knepley <knepley at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu, Jan 17, 2019 at 9:18 AM Sal Am <tempohoper at gmail.com> wrote:
>>>>>>
>>>>>>> 1) Running out of memory
>>>>>>>>
>>>>>>>> 2) You passed an invalid array
>>>>>>>>
>>>>>>> I have select=4:ncpus=32:mpiprocs=32:mem=300GB in the job script,
>>>>>>> i.e. using 300GB/node, a total of 1200GB memory, using 4 nodes and 32
>>>>>>> processors per node (128 processors in total).
>>>>>>> I am not sure what would constitute an invalid array or how I can
>>>>>>> check that. I am using the same procedure as when dealing with the smaller
>>>>>>> matrix. i.e. Generate matrix A and vector b using FEM software then convert
>>>>>>> the matrix and vector using a python script ready for petsc. read in petsc
>>>>>>> and calculate.
>>>>>>>
>>>>>>> Are you running with 64-bit ints here?
>>>>>>>>
>>>>>>> Yes I have it configured petsc with  --with-64-bit-indices and
>>>>>>> debugging mode, which this was run on.
>>>>>>>
>>>>>>
>>>>>> It sounds like you have enough memory, but the fact that is runs for
>>>>>> smaller problems makes me suspicious. It
>>>>>> could still be a memory overwrite. Can you either
>>>>>>
>>>>>> a) Run under valgrind
>>>>>>
>>>>>> or
>>>>>>
>>>>>> b) Run under the debugger and get a stack trace
>>>>>>
>>>>>>  ?
>>>>>>
>>>>>>    Thanks,
>>>>>>
>>>>>>     Matt
>>>>>>
>>>>>>
>>>>>>> On Thu, Jan 17, 2019 at 1:59 PM Matthew Knepley <knepley at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Thu, Jan 17, 2019 at 8:16 AM Sal Am <tempohoper at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> SuperLU_dist supports 64-bit ints. Are you not running in parallel?
>>>>>>>>>>
>>>>>>>>> I will try that, although I think solving the real problem (later
>>>>>>>>> on if I can get this to work) with 30 million finite elements might be a
>>>>>>>>> problem for SuperLU_dist. so it is better to get an iterative solver to
>>>>>>>>> work with first.
>>>>>>>>>
>>>>>>>>> 1) Try using    -build_twosided allreduce    on this run
>>>>>>>>>>
>>>>>>>>> How I ran it: mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>>>>>>> --log-file=valgrind.log-osaii.%p ./solveCSys -malloc off -ksp_type bcgs
>>>>>>>>> -pc_type gamg -mattransposematmult_via scalable -build_twosided allreduce
>>>>>>>>> -ksp_monitor -log_view
>>>>>>>>> I have attached the full error output.
>>>>>>>>>
>>>>>>>>
>>>>>>>> You are getting an SEGV on MatSetValues(), so its either
>>>>>>>>
>>>>>>>> 1) Running out of memory
>>>>>>>>
>>>>>>>> 2) You passed an invalid array
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) Is it possible to get something that fails here but we can run.
>>>>>>>>>> None of our tests show this problem.
>>>>>>>>>>
>>>>>>>>> I am not how I can do that, but i have added my code which is
>>>>>>>>> quite short and should only read and solve the system, the problem arises
>>>>>>>>> at larger matrices for example current test case has 6 million finite
>>>>>>>>> elements (~2B non-zero numbers and 25M x 25M matrix).
>>>>>>>>>
>>>>>>>>
>>>>>>>> Are you running with 64-bit ints here?
>>>>>>>>
>>>>>>>>    Matt
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Wed, Jan 16, 2019 at 1:12 PM Matthew Knepley <knepley at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Wed, Jan 16, 2019 at 3:52 AM Sal Am via petsc-users <
>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> The memory requested is an insane number. You may need to use 64
>>>>>>>>>>>> bit integers.
>>>>>>>>>>>
>>>>>>>>>>> Thanks Mark, I reconfigured it to use 64bit, however in the
>>>>>>>>>>> process it says I can no longer use MUMPS and SuperLU as they are not
>>>>>>>>>>> supported (I see on MUMPS webpage it supports 64int). However it does not
>>>>>>>>>>> exactly solve the problem.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> SuperLU_dist supports 64-bit ints. Are you not running in
>>>>>>>>>> parallel?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> This time, it crashes at
>>>>>>>>>>>> [6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ()
>>>>>>>>>>>> line 1989 in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>>>>>>>> ierr      = PetscMalloc1(bi[pn]+1,&bj);
>>>>>>>>>>>> which allocates local portion of B^T*A.
>>>>>>>>>>>> You may also try to increase number of cores to reduce local
>>>>>>>>>>>> matrix size.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So I increased the number of cores to 16 on one node and ran it
>>>>>>>>>>> by :
>>>>>>>>>>> mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>>>>>>>>> --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs
>>>>>>>>>>> -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
>>>>>>>>>>> It crashed after reading in the matrix and before starting to
>>>>>>>>>>> solve. The error:
>>>>>>>>>>>
>>>>>>>>>>> [15]PETSC ERROR: [0]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process
>>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>>> [0]PETSC ERROR: [1]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [2]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [2]PETSC ERROR: Caught signal number 15 Terminate: Some process
>>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>>> [3]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [4]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [8]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [12]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [12]PETSC ERROR: [14]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [14]PETSC ERROR: Caught signal number 15 Terminate: Some process
>>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>> mpiexec noticed that process rank 10 with PID 0 on node r03n01
>>>>>>>>>>> exited on signal 9 (Killed).
>>>>>>>>>>>
>>>>>>>>>>> Now I was running this with valgrind as someone had previously
>>>>>>>>>>> suggested and the 16 files created all contain the same type of error:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Okay, its possible that there are bugs in the MPI implementation.
>>>>>>>>>> So
>>>>>>>>>>
>>>>>>>>>> 1) Try using    -build_twosided allreduce    on this run
>>>>>>>>>>
>>>>>>>>>> 2) Is it possible to get something that fails here but we can
>>>>>>>>>> run. None of our tests show this problem.
>>>>>>>>>>
>>>>>>>>>>   Thanks,
>>>>>>>>>>
>>>>>>>>>>      Matt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> ==25940== Invalid read of size 8
>>>>>>>>>>> ==25940==    at 0x5103326: PetscCheckPointer (checkptr.c:81)
>>>>>>>>>>> ==25940==    by 0x4F42058: PetscCommGetNewTag (tagm.c:77)
>>>>>>>>>>> ==25940==    by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier
>>>>>>>>>>> (mpits.c:373)
>>>>>>>>>>> ==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq
>>>>>>>>>>> (mpits.c:572)
>>>>>>>>>>> ==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
>>>>>>>>>>> ==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
>>>>>>>>>>> ==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
>>>>>>>>>>> ==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
>>>>>>>>>>> ==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
>>>>>>>>>>> ==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
>>>>>>>>>>> ==25940==  Address 0x19f807fc is 12 bytes inside a block of size
>>>>>>>>>>> 16 alloc'd
>>>>>>>>>>> ==25940==    at 0x4C2A603: memalign (vg_replace_malloc.c:899)
>>>>>>>>>>> ==25940==    by 0x4FD0B0E: PetscMallocAlign (mal.c:41)
>>>>>>>>>>> ==25940==    by 0x4FD23E7: PetscMallocA (mal.c:397)
>>>>>>>>>>> ==25940==    by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier
>>>>>>>>>>> (mpits.c:371)
>>>>>>>>>>> ==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq
>>>>>>>>>>> (mpits.c:572)
>>>>>>>>>>> ==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
>>>>>>>>>>> ==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
>>>>>>>>>>> ==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
>>>>>>>>>>> ==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
>>>>>>>>>>> ==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
>>>>>>>>>>> ==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
>>>>>>>>>>> ==25940==
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong <hzhang at mcs.anl.gov>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Fande:
>>>>>>>>>>>>
>>>>>>>>>>>>> According to this PR
>>>>>>>>>>>>> https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff
>>>>>>>>>>>>>
>>>>>>>>>>>>> Should we set the scalable algorithm as default?
>>>>>>>>>>>>>
>>>>>>>>>>>> Sure, we can. But I feel we need do more tests to compare
>>>>>>>>>>>> scalable and non-scalable algorithms.
>>>>>>>>>>>> On theory, for small to medium matrices, non-scalable
>>>>>>>>>>>> matmatmult() algorithm enables more efficient
>>>>>>>>>>>> data accessing. Andreas optimized scalable implementation. Our
>>>>>>>>>>>> non-scalable implementation might have room to be further optimized.
>>>>>>>>>>>> Hong
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users <
>>>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Add option '-mattransposematmult_via scalable'
>>>>>>>>>>>>>> Hong
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via
>>>>>>>>>>>>>> petsc-users <petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I saw the following error message in your first email.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [0]PETSC ERROR: Out of memory. This could be due to
>>>>>>>>>>>>>>> allocating
>>>>>>>>>>>>>>> [0]PETSC ERROR: too large an object or bleeding by not
>>>>>>>>>>>>>>> properly
>>>>>>>>>>>>>>> [0]PETSC ERROR: destroying unneeded objects.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Probably the matrix is too large. You can try with more
>>>>>>>>>>>>>>> compute nodes, for example, use 8 nodes instead of 2, and see what happens.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --Junchao Zhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <
>>>>>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Using a larger problem set with 2B non-zero elements and a
>>>>>>>>>>>>>>>> matrix of 25M x 25M I get the following error:
>>>>>>>>>>>>>>>> [4]PETSC ERROR:
>>>>>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>>>>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>>>>>>>>>>>>>> Violation, probably memory access out of range
>>>>>>>>>>>>>>>> [4]PETSC ERROR: Try option -start_in_debugger or
>>>>>>>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>>>>>>>> [4]PETSC ERROR: or see
>>>>>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>>>>>>>>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux
>>>>>>>>>>>>>>>> and Apple Mac OS X to find memory corruption errors
>>>>>>>>>>>>>>>> [4]PETSC ERROR: likely location of problem given in stack
>>>>>>>>>>>>>>>> below
>>>>>>>>>>>>>>>> [4]PETSC ERROR: ---------------------  Stack Frames
>>>>>>>>>>>>>>>> ------------------------------------
>>>>>>>>>>>>>>>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack
>>>>>>>>>>>>>>>> are not available,
>>>>>>>>>>>>>>>> [4]PETSC ERROR:       INSTEAD the line number of the start
>>>>>>>>>>>>>>>> of the function
>>>>>>>>>>>>>>>> [4]PETSC ERROR:       is given.
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line
>>>>>>>>>>>>>>>> 747 /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4]
>>>>>>>>>>>>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line
>>>>>>>>>>>>>>>> 1156 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] PCSetUp line 894
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] KSPSetUp line 304
>>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>>>>>>> [4]PETSC ERROR: --------------------- Error Message
>>>>>>>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>>>>>>>> [4]PETSC ERROR: Signal received
>>>>>>>>>>>>>>>> [4]PETSC ERROR: See
>>>>>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for
>>>>>>>>>>>>>>>> trouble shooting.
>>>>>>>>>>>>>>>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
>>>>>>>>>>>>>>>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named
>>>>>>>>>>>>>>>> r02g03 by vef002 Fri Jan 11 09:13:23 2019
>>>>>>>>>>>>>>>> [4]PETSC ERROR: Configure options
>>>>>>>>>>>>>>>> PETSC_ARCH=linux-cumulus-debug
>>>>>>>>>>>>>>>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
>>>>>>>>>>>>>>>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
>>>>>>>>>>>>>>>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
>>>>>>>>>>>>>>>> --download-parmetis --download-metis --download-ptscotch
>>>>>>>>>>>>>>>> --download-superlu_dist --download-mumps --with-scalar-type=complex
>>>>>>>>>>>>>>>> --with-debugging=yes --download-scalapack --download-superlu
>>>>>>>>>>>>>>>> --download-fblaslapack=1 --download-cmake
>>>>>>>>>>>>>>>> [4]PETSC ERROR: #1 User provided function() line 0 in
>>>>>>>>>>>>>>>> unknown file
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>> MPI_ABORT was invoked on rank 4 in communicator
>>>>>>>>>>>>>>>> MPI_COMM_WORLD
>>>>>>>>>>>>>>>> with errorcode 59.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI
>>>>>>>>>>>>>>>> processes.
>>>>>>>>>>>>>>>> You may or may not see output from other processes,
>>>>>>>>>>>>>>>> depending on
>>>>>>>>>>>>>>>> exactly when Open MPI kills them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>> [0]PETSC ERROR:
>>>>>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some
>>>>>>>>>>>>>>>> process (or the batch system) has told this process to end
>>>>>>>>>>>>>>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>>>>>>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>>>>>>>> [0]PETSC ERROR: or see
>>>>>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Using Valgrind on only one of the valgrind files the
>>>>>>>>>>>>>>>> following error was written:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ==9053== Invalid read of size 4
>>>>>>>>>>>>>>>> ==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays
>>>>>>>>>>>>>>>> (aij.c:4445)
>>>>>>>>>>>>>>>> ==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ
>>>>>>>>>>>>>>>> (matmatmult.c:790)
>>>>>>>>>>>>>>>> ==9053==    by 0x5D106F8:
>>>>>>>>>>>>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
>>>>>>>>>>>>>>>> ==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ
>>>>>>>>>>>>>>>> (mpimatmatmult.c:1186)
>>>>>>>>>>>>>>>> ==9053==    by 0x5457C57: MatTransposeMatMult
>>>>>>>>>>>>>>>> (matrix.c:9984)
>>>>>>>>>>>>>>>> ==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
>>>>>>>>>>>>>>>> ==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
>>>>>>>>>>>>>>>> ==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
>>>>>>>>>>>>>>>> ==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
>>>>>>>>>>>>>>>> ==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
>>>>>>>>>>>>>>>> ==9053==  Address 0x8386997f4 is not stack'd, malloc'd or
>>>>>>>>>>>>>>>> (recently) free'd
>>>>>>>>>>>>>>>> ==9053==
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <
>>>>>>>>>>>>>>>> tempohoper at gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you Dave,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I reconfigured PETSc with valgrind and debugging mode, I
>>>>>>>>>>>>>>>>> ran the code again with the following options:
>>>>>>>>>>>>>>>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
>>>>>>>>>>>>>>>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type
>>>>>>>>>>>>>>>>> gamg -log_view
>>>>>>>>>>>>>>>>> (as on the petsc website you linked)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It finished solving using the iterative solver, but the
>>>>>>>>>>>>>>>>> resulting valgrind.log.%p files (all 8 corresponding to each processor) are
>>>>>>>>>>>>>>>>> all empty. And it took a whooping ~15hours, for what used to take
>>>>>>>>>>>>>>>>> ~10-20min. Maybe this is because of valgrind? I am not sure. Attached is
>>>>>>>>>>>>>>>>> the log_view.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <
>>>>>>>>>>>>>>>>> dave.mayhem23 at gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <
>>>>>>>>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am not sure what is exactly is wrong as the error
>>>>>>>>>>>>>>>>>>> changes slightly every time I run it (without changing the parameters).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This likely implies that you have a memory error in your
>>>>>>>>>>>>>>>>>> code (a memory leak would not cause this behaviour).
>>>>>>>>>>>>>>>>>> I strongly suggest you make sure your code is free of
>>>>>>>>>>>>>>>>>> memory errors.
>>>>>>>>>>>>>>>>>> You can do this using valgrind. See here
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for an explanation of how to use valgrind.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have attached the first two run's errors and my code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there a memory leak somewhere? I have tried running
>>>>>>>>>>>>>>>>>>> it with -malloc_dump, but not getting anything printed out, however, when
>>>>>>>>>>>>>>>>>>> run with -log_view I see that Viewer is created 4 times, but destroyed 3
>>>>>>>>>>>>>>>>>>> times. The way I see it, I have destroyed it where I see I no longer have
>>>>>>>>>>>>>>>>>>> use for it so not sure if I am wrong. Could this be the reason why it keeps
>>>>>>>>>>>>>>>>>>> crashing? It crashes as soon as it reads the matrix, before entering the
>>>>>>>>>>>>>>>>>>> solving mode (I have a print statement before solving starts that never
>>>>>>>>>>>>>>>>>>> prints).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> how I run it in the job script on 2 node with 32
>>>>>>>>>>>>>>>>>>> processors using the clusters OpenMPI.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg
>>>>>>>>>>>>>>>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view
>>>>>>>>>>>>>>>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the matrix:
>>>>>>>>>>>>>>>>>>> 2 122 821 366 (non-zero elements)
>>>>>>>>>>>>>>>>>>> 25 947 279 x 25 947 279
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks and all the best
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>>> experiments lead.
>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>
>>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>> experiments lead.
>>>>>>>> -- Norbert Wiener
>>>>>>>>
>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>
>>>>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190129/cf8bc999/attachment-0001.html>


More information about the petsc-users mailing list