[petsc-users] MPI Iterative solver crash on HPC

Tue Jan 29 08:05:48 CST 2019

Thank you Zhang for that, I am a bit confused by the terminology maybe, but
when you solved it on 16nodes (each node 128GB) with 576ranks that should
give 2048GB (~2TB) of memory in total right? Does this mean the MPI does
not work for SuperLU_dist? Unfortunately I cannot run the code on one node,
as our nodes are limited to 500GB memory, hence I was hoping I could
utilise several nodes and more memory.

If anyone else has a solution for an iterative solver please do recommend
as I am a bit stuck since so far and the combinations I have tried did not
work.

On Fri, Jan 25, 2019 at 5:32 PM Zhang, Junchao <jczhang at mcs.anl.gov> wrote:

> Hi, Sal Am,
>  I did some testes with your matrix and vector. It is a complex matrix
> with N=4.7M and nnz=417M. Firstly, I tested on a machine with 36 cores and
> 128GB memory on each compute node. I tried with direct solver and iterative
> solver but both failed. For example, with 36 ranks on one compute node, I
> got
>
> [9]PETSC ERROR: [9] SuperLU_DIST:pzgssvx line 465
> /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> [9]PETSC ERROR: [9] MatLUFactorNumeric_SuperLU_DIST line 314
> /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>
>   With 16 nodes, 576 ranks. I got
>
> SUPERLU_MALLOC fails for GAstore->rnzval[] at line 240 in file
> /blues/gpfs/home/jczhang/petsc/bdw-dbg-complex/externalpackages/git.superlu_dist/SRC/pzutil.c
>
>   Next, I moved to another single-node machine with 1.5TB memory. It did
> not fail this time. It ran overnight and is still doing superlu. Using the
> top command, I found at peak, it consumed almost all memory. At stable
> period, with 36 ranks, each rank consumed about 20GB memory. When I changed
> to iterative solvers with -ksp_type bcgs -pc_type gamg
> -mattransposematmult_via scalable, I did not meet errors seen on the
> smaller memory machine. But the residual did not converge.
>   So, I think the errors you met were simply out of memory error either in
> superlu or in petsc.  If you have machines with large memory, you can try
> it on. Otherwise, I let other petsc developers suggest better iterative
> solvers to you.
>   Thanks.
>
>
> --Junchao Zhang
>
>
> On Wed, Jan 23, 2019 at 2:52 AM Sal Am <tempohoper at gmail.com> wrote:
>
>> Sorry it took long had to see if I could shrink down the problem files
>> from 50GB to something smaller (now ~10GB).
>>
>>> Can you compress your matrix and upload it to google drive, so we can
>>> try to reproduce the error.
>>>
>>
>> How I ran the problem: mpiexec valgrind --tool=memcheck
>> --suppressions=$HOME/valgrind/valgrind-openmpi.supp -q --num-callers=20
>> --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres
>> -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1
>> -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged
>>
>> here is the link to matrix A and vector b:
>> https://drive.google.com/drive/folders/16YQPTK6TfXC6pV5RMdJ9g7X-ZiqbvwU8?usp=sharing
>>
>> I redid the problem (twice) by trying to solve a 1M finite elements
>> problem corresponding to ~ 4M n and 417M nnz matrix elements on the login
>> shell which has ~550GB mem, but it failed. First time it failed because of
>> bus error, second time it said killed. I have attached valgrind file from
>> both runs.
>>
>> OpenMPI is not my favorite. You need to use a suppressions file to get
>>> rid of all of that noise. Here is one:
>>>
>>
>> Thanks I have been using it, but sometimes I still see same amount of
>> errors.
>>
>>
>>
>> On Fri, Jan 18, 2019 at 3:12 AM Zhang, Junchao <jczhang at mcs.anl.gov>
>> wrote:
>>
>>> Usually when I meet a SEGV error, I will run it again with a parallel
>>> debugger like DDT and wait for it to segfault, and then examine the stack
>>> trace to see what is wrong.
>>> Can you compress your matrix and upload it to google drive, so we can
>>> try to reproduce the error.
>>> --Junchao Zhang
>>>
>>>
>>> On Thu, Jan 17, 2019 at 10:44 AM Sal Am via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>> I did two runs, one with the SuperLU_dist and one with bcgs using
>>>> jacobi, attached are the results of one of the reports from valgrind on one
>>>> random processor (out of the 128 files).
>>>>
>>>> DS = direct solver
>>>> IS = iterative  solver
>>>>
>>>> There is an awful lot of errors.
>>>>
>>>> how I initiated the two runs:
>>>> mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>> --log-file=valgrind.log-IS.%p ./solveCSys -malloc off -ksp_type bcgs
>>>> -pc_type jacobi -mattransposematmult_via scalable -build_twosided allreduce
>>>> -ksp_monitor -log_view
>>>>
>>>> mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>> --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres
>>>> -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1
>>>> -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged
>>>>
>>>> Thank you
>>>>
>>>> On Thu, Jan 17, 2019 at 4:24 PM Matthew Knepley <knepley at gmail.com>
>>>> wrote:
>>>>
>>>>> On Thu, Jan 17, 2019 at 9:18 AM Sal Am <tempohoper at gmail.com> wrote:
>>>>>
>>>>>> 1) Running out of memory
>>>>>>>
>>>>>>> 2) You passed an invalid array
>>>>>>>
>>>>>> I have select=4:ncpus=32:mpiprocs=32:mem=300GB in the job script,
>>>>>> i.e. using 300GB/node, a total of 1200GB memory, using 4 nodes and 32
>>>>>> processors per node (128 processors in total).
>>>>>> I am not sure what would constitute an invalid array or how I can
>>>>>> check that. I am using the same procedure as when dealing with the smaller
>>>>>> matrix. i.e. Generate matrix A and vector b using FEM software then convert
>>>>>> the matrix and vector using a python script ready for petsc. read in petsc
>>>>>> and calculate.
>>>>>>
>>>>>> Are you running with 64-bit ints here?
>>>>>>>
>>>>>> Yes I have it configured petsc with  --with-64-bit-indices and
>>>>>> debugging mode, which this was run on.
>>>>>>
>>>>>
>>>>> It sounds like you have enough memory, but the fact that is runs for
>>>>> smaller problems makes me suspicious. It
>>>>> could still be a memory overwrite. Can you either
>>>>>
>>>>> a) Run under valgrind
>>>>>
>>>>> or
>>>>>
>>>>> b) Run under the debugger and get a stack trace
>>>>>
>>>>>  ?
>>>>>
>>>>>    Thanks,
>>>>>
>>>>>     Matt
>>>>>
>>>>>
>>>>>> On Thu, Jan 17, 2019 at 1:59 PM Matthew Knepley <knepley at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Jan 17, 2019 at 8:16 AM Sal Am <tempohoper at gmail.com> wrote:
>>>>>>>
>>>>>>>> SuperLU_dist supports 64-bit ints. Are you not running in parallel?
>>>>>>>>>
>>>>>>>> I will try that, although I think solving the real problem (later
>>>>>>>> on if I can get this to work) with 30 million finite elements might be a
>>>>>>>> problem for SuperLU_dist. so it is better to get an iterative solver to
>>>>>>>> work with first.
>>>>>>>>
>>>>>>>> 1) Try using    -build_twosided allreduce    on this run
>>>>>>>>>
>>>>>>>> How I ran it: mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>>>>>> --log-file=valgrind.log-osaii.%p ./solveCSys -malloc off -ksp_type bcgs
>>>>>>>> -pc_type gamg -mattransposematmult_via scalable -build_twosided allreduce
>>>>>>>> -ksp_monitor -log_view
>>>>>>>> I have attached the full error output.
>>>>>>>>
>>>>>>>
>>>>>>> You are getting an SEGV on MatSetValues(), so its either
>>>>>>>
>>>>>>> 1) Running out of memory
>>>>>>>
>>>>>>> 2) You passed an invalid array
>>>>>>>
>>>>>>>
>>>>>>>> 2) Is it possible to get something that fails here but we can run.
>>>>>>>>> None of our tests show this problem.
>>>>>>>>>
>>>>>>>> I am not how I can do that, but i have added my code which is quite
>>>>>>>> short and should only read and solve the system, the problem arises at
>>>>>>>> larger matrices for example current test case has 6 million finite elements
>>>>>>>> (~2B non-zero numbers and 25M x 25M matrix).
>>>>>>>>
>>>>>>>
>>>>>>> Are you running with 64-bit ints here?
>>>>>>>
>>>>>>>    Matt
>>>>>>>
>>>>>>>
>>>>>>>> On Wed, Jan 16, 2019 at 1:12 PM Matthew Knepley <knepley at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Jan 16, 2019 at 3:52 AM Sal Am via petsc-users <
>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> The memory requested is an insane number. You may need to use 64
>>>>>>>>>>> bit integers.
>>>>>>>>>>
>>>>>>>>>> Thanks Mark, I reconfigured it to use 64bit, however in the
>>>>>>>>>> process it says I can no longer use MUMPS and SuperLU as they are not
>>>>>>>>>> supported (I see on MUMPS webpage it supports 64int). However it does not
>>>>>>>>>> exactly solve the problem.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> SuperLU_dist supports 64-bit ints. Are you not running in parallel?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This time, it crashes at
>>>>>>>>>>> [6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ()
>>>>>>>>>>> line 1989 in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>>>>>>> ierr      = PetscMalloc1(bi[pn]+1,&bj);
>>>>>>>>>>> which allocates local portion of B^T*A.
>>>>>>>>>>> You may also try to increase number of cores to reduce local
>>>>>>>>>>> matrix size.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So I increased the number of cores to 16 on one node and ran it
>>>>>>>>>> by :
>>>>>>>>>> mpiexec valgrind --tool=memcheck -q --num-callers=20
>>>>>>>>>> --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs
>>>>>>>>>> -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
>>>>>>>>>> It crashed after reading in the matrix and before starting to
>>>>>>>>>> solve. The error:
>>>>>>>>>>
>>>>>>>>>> [15]PETSC ERROR: [0]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process
>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>> [0]PETSC ERROR: [1]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [2]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [2]PETSC ERROR: Caught signal number 15 Terminate: Some process
>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>> [3]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [4]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [8]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [12]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [12]PETSC ERROR: [14]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [14]PETSC ERROR: Caught signal number 15 Terminate: Some process
>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>>
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> mpiexec noticed that process rank 10 with PID 0 on node r03n01
>>>>>>>>>> exited on signal 9 (Killed).
>>>>>>>>>>
>>>>>>>>>> Now I was running this with valgrind as someone had previously
>>>>>>>>>> suggested and the 16 files created all contain the same type of error:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Okay, its possible that there are bugs in the MPI implementation.
>>>>>>>>> So
>>>>>>>>>
>>>>>>>>> 1) Try using    -build_twosided allreduce    on this run
>>>>>>>>>
>>>>>>>>> 2) Is it possible to get something that fails here but we can run.
>>>>>>>>> None of our tests show this problem.
>>>>>>>>>
>>>>>>>>>   Thanks,
>>>>>>>>>
>>>>>>>>>      Matt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ==25940== Invalid read of size 8
>>>>>>>>>> ==25940==    at 0x5103326: PetscCheckPointer (checkptr.c:81)
>>>>>>>>>> ==25940==    by 0x4F42058: PetscCommGetNewTag (tagm.c:77)
>>>>>>>>>> ==25940==    by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier
>>>>>>>>>> (mpits.c:373)
>>>>>>>>>> ==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq
>>>>>>>>>> (mpits.c:572)
>>>>>>>>>> ==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
>>>>>>>>>> ==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
>>>>>>>>>> ==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
>>>>>>>>>> ==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
>>>>>>>>>> ==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
>>>>>>>>>> ==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
>>>>>>>>>> ==25940==  Address 0x19f807fc is 12 bytes inside a block of size
>>>>>>>>>> 16 alloc'd
>>>>>>>>>> ==25940==    at 0x4C2A603: memalign (vg_replace_malloc.c:899)
>>>>>>>>>> ==25940==    by 0x4FD0B0E: PetscMallocAlign (mal.c:41)
>>>>>>>>>> ==25940==    by 0x4FD23E7: PetscMallocA (mal.c:397)
>>>>>>>>>> ==25940==    by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier
>>>>>>>>>> (mpits.c:371)
>>>>>>>>>> ==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq
>>>>>>>>>> (mpits.c:572)
>>>>>>>>>> ==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
>>>>>>>>>> ==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
>>>>>>>>>> ==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
>>>>>>>>>> ==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
>>>>>>>>>> ==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
>>>>>>>>>> ==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
>>>>>>>>>> ==25940==
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong <hzhang at mcs.anl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Fande:
>>>>>>>>>>>
>>>>>>>>>>>> According to this PR
>>>>>>>>>>>> https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff
>>>>>>>>>>>>
>>>>>>>>>>>> Should we set the scalable algorithm as default?
>>>>>>>>>>>>
>>>>>>>>>>> Sure, we can. But I feel we need do more tests to compare
>>>>>>>>>>> scalable and non-scalable algorithms.
>>>>>>>>>>> On theory, for small to medium matrices, non-scalable
>>>>>>>>>>> matmatmult() algorithm enables more efficient
>>>>>>>>>>> data accessing. Andreas optimized scalable implementation. Our
>>>>>>>>>>> non-scalable implementation might have room to be further optimized.
>>>>>>>>>>> Hong
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users <
>>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Add option '-mattransposematmult_via scalable'
>>>>>>>>>>>>> Hong
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users
>>>>>>>>>>>>> <petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I saw the following error message in your first email.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [0]PETSC ERROR: Out of memory. This could be due to allocating
>>>>>>>>>>>>>> [0]PETSC ERROR: too large an object or bleeding by not
>>>>>>>>>>>>>> properly
>>>>>>>>>>>>>> [0]PETSC ERROR: destroying unneeded objects.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Probably the matrix is too large. You can try with more
>>>>>>>>>>>>>> compute nodes, for example, use 8 nodes instead of 2, and see what happens.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Junchao Zhang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <
>>>>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Using a larger problem set with 2B non-zero elements and a
>>>>>>>>>>>>>>> matrix of 25M x 25M I get the following error:
>>>>>>>>>>>>>>> [4]PETSC ERROR:
>>>>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>>>> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>>>>>>>>>>>>> Violation, probably memory access out of range
>>>>>>>>>>>>>>> [4]PETSC ERROR: Try option -start_in_debugger or
>>>>>>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>>>>>>> [4]PETSC ERROR: or see
>>>>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>>>>>>>> [4]PETSC ERROR: or try http://valgrind.org on GNU/linux and
>>>>>>>>>>>>>>> Apple Mac OS X to find memory corruption errors
>>>>>>>>>>>>>>> [4]PETSC ERROR: likely location of problem given in stack
>>>>>>>>>>>>>>> below
>>>>>>>>>>>>>>> [4]PETSC ERROR: ---------------------  Stack Frames
>>>>>>>>>>>>>>> ------------------------------------
>>>>>>>>>>>>>>> [4]PETSC ERROR: Note: The EXACT line numbers in the stack
>>>>>>>>>>>>>>> are not available,
>>>>>>>>>>>>>>> [4]PETSC ERROR:       INSTEAD the line number of the start
>>>>>>>>>>>>>>> of the function
>>>>>>>>>>>>>>> [4]PETSC ERROR:       is given.
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line
>>>>>>>>>>>>>>> 747 /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4]
>>>>>>>>>>>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line
>>>>>>>>>>>>>>> 1156 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] MatTransposeMatMult line 9950
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/mat/interface/matrix.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] PCSetUp_GAMG line 428
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] PCSetUp line 894
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: [4] KSPSetUp line 304
>>>>>>>>>>>>>>> /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>>>>>> [4]PETSC ERROR: --------------------- Error Message
>>>>>>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>>>>>>> [4]PETSC ERROR: Signal received
>>>>>>>>>>>>>>> [4]PETSC ERROR: See
>>>>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for
>>>>>>>>>>>>>>> trouble shooting.
>>>>>>>>>>>>>>> [4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
>>>>>>>>>>>>>>> [4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named
>>>>>>>>>>>>>>> r02g03 by vef002 Fri Jan 11 09:13:23 2019
>>>>>>>>>>>>>>> [4]PETSC ERROR: Configure options
>>>>>>>>>>>>>>> PETSC_ARCH=linux-cumulus-debug
>>>>>>>>>>>>>>> --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc
>>>>>>>>>>>>>>> --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort
>>>>>>>>>>>>>>> --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx
>>>>>>>>>>>>>>> --download-parmetis --download-metis --download-ptscotch
>>>>>>>>>>>>>>> --download-superlu_dist --download-mumps --with-scalar-type=complex
>>>>>>>>>>>>>>> --with-debugging=yes --download-scalapack --download-superlu
>>>>>>>>>>>>>>> --download-fblaslapack=1 --download-cmake
>>>>>>>>>>>>>>> [4]PETSC ERROR: #1 User provided function() line 0 in
>>>>>>>>>>>>>>> unknown file
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>> MPI_ABORT was invoked on rank 4 in communicator
>>>>>>>>>>>>>>> MPI_COMM_WORLD
>>>>>>>>>>>>>>> with errorcode 59.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI
>>>>>>>>>>>>>>> processes.
>>>>>>>>>>>>>>> You may or may not see output from other processes,
>>>>>>>>>>>>>>> depending on
>>>>>>>>>>>>>>> exactly when Open MPI kills them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>> [0]PETSC ERROR:
>>>>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some
>>>>>>>>>>>>>>> process (or the batch system) has told this process to end
>>>>>>>>>>>>>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>>>>>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>>>>>>> [0]PETSC ERROR: or see
>>>>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Using Valgrind on only one of the valgrind files the
>>>>>>>>>>>>>>> following error was written:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ==9053== Invalid read of size 4
>>>>>>>>>>>>>>> ==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays
>>>>>>>>>>>>>>> (aij.c:4445)
>>>>>>>>>>>>>>> ==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ
>>>>>>>>>>>>>>> (matmatmult.c:790)
>>>>>>>>>>>>>>> ==9053==    by 0x5D106F8:
>>>>>>>>>>>>>>> MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
>>>>>>>>>>>>>>> ==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ
>>>>>>>>>>>>>>> (mpimatmatmult.c:1186)
>>>>>>>>>>>>>>> ==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
>>>>>>>>>>>>>>> ==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
>>>>>>>>>>>>>>> ==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
>>>>>>>>>>>>>>> ==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
>>>>>>>>>>>>>>> ==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
>>>>>>>>>>>>>>> ==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
>>>>>>>>>>>>>>> ==9053==  Address 0x8386997f4 is not stack'd, malloc'd or
>>>>>>>>>>>>>>> (recently) free'd
>>>>>>>>>>>>>>> ==9053==
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempohoper at gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you Dave,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I reconfigured PETSc with valgrind and debugging mode, I
>>>>>>>>>>>>>>>> ran the code again with the following options:
>>>>>>>>>>>>>>>> mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20
>>>>>>>>>>>>>>>> --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type
>>>>>>>>>>>>>>>> gamg -log_view
>>>>>>>>>>>>>>>> (as on the petsc website you linked)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It finished solving using the iterative solver, but the
>>>>>>>>>>>>>>>> resulting valgrind.log.%p files (all 8 corresponding to each processor) are
>>>>>>>>>>>>>>>> all empty. And it took a whooping ~15hours, for what used to take
>>>>>>>>>>>>>>>> ~10-20min. Maybe this is because of valgrind? I am not sure. Attached is
>>>>>>>>>>>>>>>> the log_view.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Jan 10, 2019 at 8:59 AM Dave May <
>>>>>>>>>>>>>>>> dave.mayhem23 at gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <
>>>>>>>>>>>>>>>>> petsc-users at mcs.anl.gov> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am not sure what is exactly is wrong as the error
>>>>>>>>>>>>>>>>>> changes slightly every time I run it (without changing the parameters).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This likely implies that you have a memory error in your
>>>>>>>>>>>>>>>>> code (a memory leak would not cause this behaviour).
>>>>>>>>>>>>>>>>> I strongly suggest you make sure your code is free of
>>>>>>>>>>>>>>>>> memory errors.
>>>>>>>>>>>>>>>>> You can do this using valgrind. See here
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> for an explanation of how to use valgrind.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have attached the first two run's errors and my code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there a memory leak somewhere? I have tried running it
>>>>>>>>>>>>>>>>>> with -malloc_dump, but not getting anything printed out, however, when run
>>>>>>>>>>>>>>>>>> with -log_view I see that Viewer is created 4 times, but destroyed 3 times.
>>>>>>>>>>>>>>>>>> The way I see it, I have destroyed it where I see I no longer have use for
>>>>>>>>>>>>>>>>>> it so not sure if I am wrong. Could this be the reason why it keeps
>>>>>>>>>>>>>>>>>> crashing? It crashes as soon as it reads the matrix, before entering the
>>>>>>>>>>>>>>>>>> solving mode (I have a print statement before solving starts that never
>>>>>>>>>>>>>>>>>> prints).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> how I run it in the job script on 2 node with 32
>>>>>>>>>>>>>>>>>> processors using the clusters OpenMPI.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg
>>>>>>>>>>>>>>>>>> -ksp_converged_reason -ksp_monitor_true_residual -log_view
>>>>>>>>>>>>>>>>>> -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the matrix:
>>>>>>>>>>>>>>>>>> 2 122 821 366 (non-zero elements)
>>>>>>>>>>>>>>>>>> 25 947 279 x 25 947 279
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks and all the best
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>>>> experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>> experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which their
>>>>> experiments lead.
>>>>> -- Norbert Wiener
>>>>>
>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190129/6b0a1245/attachment-0001.html>