[petsc-users] MPI Iterative solver crash on HPC

Fri Jan 25 11:32:10 CST 2019

Hi, Sal Am,
 I did some testes with your matrix and vector. It is a complex matrix with N=4.7M and nnz=417M. Firstly, I tested on a machine with 36 cores and 128GB memory on each compute node. I tried with direct solver and iterative solver but both failed. For example, with 36 ranks on one compute node, I got
[9]PETSC ERROR: [9] SuperLU_DIST:pzgssvx line 465 /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[9]PETSC ERROR: [9] MatLUFactorNumeric_SuperLU_DIST line 314 /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
  With 16 nodes, 576 ranks. I got
SUPERLU_MALLOC fails for GAstore->rnzval[] at line 240 in file /blues/gpfs/home/jczhang/petsc/bdw-dbg-complex/externalpackages/git.superlu_dist/SRC/pzutil.c

  Next, I moved to another single-node machine with 1.5TB memory. It did not fail this time. It ran overnight and is still doing superlu. Using the top command, I found at peak, it consumed almost all memory. At stable period, with 36 ranks, each rank consumed about 20GB memory. When I changed to iterative solvers with -ksp_type bcgs -pc_type gamg -mattransposematmult_via scalable, I did not meet errors seen on the smaller memory machine. But the residual did not converge.
  So, I think the errors you met were simply out of memory error either in superlu or in petsc.  If you have machines with large memory, you can try it on. Otherwise, I let other petsc developers suggest better iterative solvers to you.
  Thanks.

--Junchao Zhang

On Wed, Jan 23, 2019 at 2:52 AM Sal Am <tempohoper at gmail.com<mailto:tempohoper at gmail.com>> wrote:
Sorry it took long had to see if I could shrink down the problem files from 50GB to something smaller (now ~10GB).
Can you compress your matrix and upload it to google drive, so we can try to reproduce the error.

How I ran the problem: mpiexec valgrind --tool=memcheck --suppressions=$HOME/valgrind/valgrind-openmpi.supp -q --num-callers=20 --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1 -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged

here is the link to matrix A and vector b: https://drive.google.com/drive/folders/16YQPTK6TfXC6pV5RMdJ9g7X-ZiqbvwU8?usp=sharing

I redid the problem (twice) by trying to solve a 1M finite elements problem corresponding to ~ 4M n and 417M nnz matrix elements on the login shell which has ~550GB mem, but it failed. First time it failed because of bus error, second time it said killed. I have attached valgrind file from both runs.

OpenMPI is not my favorite. You need to use a suppressions file to get rid of all of that noise. Here is one:

Thanks I have been using it, but sometimes I still see same amount of errors.

On Fri, Jan 18, 2019 at 3:12 AM Zhang, Junchao <jczhang at mcs.anl.gov<mailto:jczhang at mcs.anl.gov>> wrote:
Usually when I meet a SEGV error, I will run it again with a parallel debugger like DDT and wait for it to segfault, and then examine the stack trace to see what is wrong.
Can you compress your matrix and upload it to google drive, so we can try to reproduce the error.
--Junchao Zhang

On Thu, Jan 17, 2019 at 10:44 AM Sal Am via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
I did two runs, one with the SuperLU_dist and one with bcgs using jacobi, attached are the results of one of the reports from valgrind on one random processor (out of the 128 files).

DS = direct solver
IS = iterative  solver

There is an awful lot of errors.

how I initiated the two runs:
mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-IS.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type jacobi -mattransposematmult_via scalable -build_twosided allreduce -ksp_monitor -log_view

mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1 -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged

Thank you

On Thu, Jan 17, 2019 at 4:24 PM Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:
On Thu, Jan 17, 2019 at 9:18 AM Sal Am <tempohoper at gmail.com<mailto:tempohoper at gmail.com>> wrote:
1) Running out of memory

2) You passed an invalid array
I have select=4:ncpus=32:mpiprocs=32:mem=300GB in the job script, i.e. using 300GB/node, a total of 1200GB memory, using 4 nodes and 32 processors per node (128 processors in total).
I am not sure what would constitute an invalid array or how I can check that. I am using the same procedure as when dealing with the smaller matrix. i.e. Generate matrix A and vector b using FEM software then convert the matrix and vector using a python script ready for petsc. read in petsc and calculate.

Are you running with 64-bit ints here?
Yes I have it configured petsc with  --with-64-bit-indices and debugging mode, which this was run on.

It sounds like you have enough memory, but the fact that is runs for smaller problems makes me suspicious. It
could still be a memory overwrite. Can you either

a) Run under valgrind

or

b) Run under the debugger and get a stack trace

 ?

   Thanks,

    Matt

On Thu, Jan 17, 2019 at 1:59 PM Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:
On Thu, Jan 17, 2019 at 8:16 AM Sal Am <tempohoper at gmail.com<mailto:tempohoper at gmail.com>> wrote:
SuperLU_dist supports 64-bit ints. Are you not running in parallel?
I will try that, although I think solving the real problem (later on if I can get this to work) with 30 million finite elements might be a problem for SuperLU_dist. so it is better to get an iterative solver to work with first.

1) Try using    -build_twosided allreduce    on this run
How I ran it: mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-osaii.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg -mattransposematmult_via scalable -build_twosided allreduce -ksp_monitor -log_view
I have attached the full error output.

You are getting an SEGV on MatSetValues(), so its either

1) Running out of memory

2) You passed an invalid array

2) Is it possible to get something that fails here but we can run. None of our tests show this problem.
I am not how I can do that, but i have added my code which is quite short and should only read and solve the system, the problem arises at larger matrices for example current test case has 6 million finite elements (~2B non-zero numbers and 25M x 25M matrix).

Are you running with 64-bit ints here?

   Matt

On Wed, Jan 16, 2019 at 1:12 PM Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>> wrote:
On Wed, Jan 16, 2019 at 3:52 AM Sal Am via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
The memory requested is an insane number. You may need to use 64 bit integers.
Thanks Mark, I reconfigured it to use 64bit, however in the process it says I can no longer use MUMPS and SuperLU as they are not supported (I see on MUMPS webpage it supports 64int). However it does not exactly solve the problem.

SuperLU_dist supports 64-bit ints. Are you not running in parallel?

This time, it crashes at
[6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989 in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
ierr      = PetscMalloc1(bi[pn]+1,&bj);
which allocates local portion of B^T*A.
You may also try to increase number of cores to reduce local matrix size.

So I increased the number of cores to 16 on one node and ran it by :
mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
It crashed after reading in the matrix and before starting to solve. The error:

[15]PETSC ERROR: [0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[0]PETSC ERROR: [1]PETSC ERROR: ------------------------------------------------------------------------
[2]PETSC ERROR: ------------------------------------------------------------------------
[2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[3]PETSC ERROR: ------------------------------------------------------------------------
[4]PETSC ERROR: ------------------------------------------------------------------------
[4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR: ------------------------------------------------------------------------
[8]PETSC ERROR: ------------------------------------------------------------------------
[12]PETSC ERROR: ------------------------------------------------------------------------
[12]PETSC ERROR: [14]PETSC ERROR: ------------------------------------------------------------------------
[14]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
--------------------------------------------------------------------------
mpiexec noticed that process rank 10 with PID 0 on node r03n01 exited on signal 9 (Killed).

Now I was running this with valgrind as someone had previously suggested and the 16 files created all contain the same type of error:

Okay, its possible that there are bugs in the MPI implementation. So

1) Try using    -build_twosided allreduce    on this run

2) Is it possible to get something that fails here but we can run. None of our tests show this problem.

  Thanks,

     Matt

==25940== Invalid read of size 8
==25940==    at 0x5103326: PetscCheckPointer (checkptr.c:81)
==25940==    by 0x4F42058: PetscCommGetNewTag (tagm.c:77)
==25940==    by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:373)
==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
==25940==  Address 0x19f807fc is 12 bytes inside a block of size 16 alloc'd
==25940==    at 0x4C2A603: memalign (vg_replace_malloc.c:899)
==25940==    by 0x4FD0B0E: PetscMallocAlign (mal.c:41)
==25940==    by 0x4FD23E7: PetscMallocA (mal.c:397)
==25940==    by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:371)
==25940==    by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)
==25940==    by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)
==25940==    by 0x52D6B42: VecAssemblyBegin (vector.c:140)
==25940==    by 0x5328C97: VecLoad_Binary (vecio.c:141)
==25940==    by 0x5329051: VecLoad_Default (vecio.c:516)
==25940==    by 0x52E0BAB: VecLoad (vector.c:933)
==25940==    by 0x4013D5: main (solveCmplxLinearSys.cpp:31)
==25940==

On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong <hzhang at mcs.anl.gov<mailto:hzhang at mcs.anl.gov>> wrote:
Fande:
According to this PR https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff

Should we set the scalable algorithm as default?
Sure, we can. But I feel we need do more tests to compare scalable and non-scalable algorithms.
On theory, for small to medium matrices, non-scalable matmatmult() algorithm enables more efficient
data accessing. Andreas optimized scalable implementation. Our non-scalable implementation might have room to be further optimized.
Hong

On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Add option '-mattransposematmult_via scalable'
Hong

On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
I saw the following error message in your first email.
[0]PETSC ERROR: Out of memory. This could be due to allocating
[0]PETSC ERROR: too large an object or bleeding by not properly
[0]PETSC ERROR: destroying unneeded objects.
Probably the matrix is too large. You can try with more compute nodes, for example, use 8 nodes instead of 2, and see what happens.

--Junchao Zhang

On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
Using a larger problem set with 2B non-zero elements and a matrix of 25M x 25M I get the following error:
[4]PETSC ERROR: ------------------------------------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[4]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[4]PETSC ERROR: likely location of problem given in stack below
[4]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[4]PETSC ERROR:       INSTEAD the line number of the start of the function
[4]PETSC ERROR:       is given.
[4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422 /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c
[4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747 /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c
[4]PETSC ERROR: [4] MatTransposeMatMult line 9950 /lustre/home/vef002/petsc/src/mat/interface/matrix.c
[4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871 /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c
[4]PETSC ERROR: [4] PCSetUp_GAMG line 428 /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c
[4]PETSC ERROR: [4] PCSetUp line 894 /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c
[4]PETSC ERROR: [4] KSPSetUp line 304 /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c
[4]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[4]PETSC ERROR: Signal received
[4]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[4]PETSC ERROR: Petsc Release Version 3.10.2, unknown
[4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by vef002 Fri Jan 11 09:13:23 2019
[4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx --download-parmetis --download-metis --download-ptscotch --download-superlu_dist --download-mumps --with-scalar-type=complex --with-debugging=yes --download-scalapack --download-superlu --download-fblaslapack=1 --download-cmake
[4]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind

Using Valgrind on only one of the valgrind files the following error was written:

==9053== Invalid read of size 4
==9053==    at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)
==9053==    by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ (matmatmult.c:790)
==9053==    by 0x5D106F8: MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)
==9053==    by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ (mpimatmatmult.c:1186)
==9053==    by 0x5457C57: MatTransposeMatMult (matrix.c:9984)
==9053==    by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)
==9053==    by 0x64C7527: PCSetUp_GAMG (gamg.c:522)
==9053==    by 0x6592AA0: PCSetUp (precon.c:932)
==9053==    by 0x66B1267: KSPSetUp (itfunc.c:391)
==9053==    by 0x4019A2: main (solveCmplxLinearSys.cpp:68)
==9053==  Address 0x8386997f4 is not stack'd, malloc'd or (recently) free'd
==9053==

On Fri, Jan 11, 2019 at 8:41 AM Sal Am <tempohoper at gmail.com<mailto:tempohoper at gmail.com>> wrote:
Thank you Dave,

I reconfigured PETSc with valgrind and debugging mode, I ran the code again with the following options:
mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg -log_view
(as on the petsc website you linked)

It finished solving using the iterative solver, but the resulting valgrind.log.%p files (all 8 corresponding to each processor) are all empty. And it took a whooping ~15hours, for what used to take ~10-20min. Maybe this is because of valgrind? I am not sure. Attached is the log_view.

On Thu, Jan 10, 2019 at 8:59 AM Dave May <dave.mayhem23 at gmail.com<mailto:dave.mayhem23 at gmail.com>> wrote:

On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>> wrote:
I am not sure what is exactly is wrong as the error changes slightly every time I run it (without changing the parameters).

This likely implies that you have a memory error in your code (a memory leak would not cause this behaviour).
I strongly suggest you make sure your code is free of memory errors.
You can do this using valgrind. See here

https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind

for an explanation of how to use valgrind.

I have attached the first two run's errors and my code.

Is there a memory leak somewhere? I have tried running it with -malloc_dump, but not getting anything printed out, however, when run with -log_view I see that Viewer is created 4 times, but destroyed 3 times. The way I see it, I have destroyed it where I see I no longer have use for it so not sure if I am wrong. Could this be the reason why it keeps crashing? It crashes as soon as it reads the matrix, before entering the solving mode (I have a print statement before solving starts that never prints).

how I run it in the job script on 2 node with 32 processors using the clusters OpenMPI.

mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg -ksp_converged_reason -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view

the matrix:
2 122 821 366 (non-zero elements)
25 947 279 x 25 947 279

Thanks and all the best

--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>

--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190125/37f18440/attachment-0001.html>