<div dir="ltr"><div>Thank you Zhang for that, I am a bit confused by the terminology maybe, but when you solved it on 16nodes (each node 128GB) with 576ranks that should give 2048GB (~2TB) of memory in total right? Does this mean the MPI does not work for SuperLU_dist? Unfortunately I cannot run the code on one node, as our nodes are limited to 500GB memory, hence I was hoping I could utilise several nodes and more memory.</div><div><br> </div><div>If anyone else has a solution for an iterative solver please do recommend as I am a bit stuck since so far and the combinations I have tried did not work.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 25, 2019 at 5:32 PM Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov">jczhang@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">Hi, Sal Am,
<div> I did some testes with your matrix and vector. It is a complex matrix with N=4.7M and nnz=417M. Firstly, I tested on a machine with 36 cores and 128GB memory on each compute node. I tried with direct solver and iterative solver but both failed. For example,
with 36 ranks on one compute node, I got</div>
</div>
</div>
<blockquote style="margin:0px 0px 0px 40px;border:medium none;padding:0px">
<div dir="ltr">
<div dir="ltr">
<div>[9]PETSC ERROR: [9] SuperLU_DIST:pzgssvx line 465 /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c</div>
<div>[9]PETSC ERROR: [9] MatLUFactorNumeric_SuperLU_DIST line 314 /blues/gpfs/home/jczhang/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c</div>
</div>
</div>
</blockquote>
<div dir="ltr">
<div dir="ltr">
<div> With 16 nodes, 576 ranks. I got </div>
</div>
</div>
</div>
</div>
<blockquote style="margin:0px 0px 0px 40px;border:medium none;padding:0px">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>SUPERLU_MALLOC fails for GAstore->rnzval[] at line 240 in file /blues/gpfs/home/jczhang/petsc/bdw-dbg-complex/externalpackages/git.superlu_dist/SRC/pzutil.c</div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
Next, I moved to another single-node machine with 1.5TB memory. It did not fail this time. It ran overnight and is still doing superlu. Using the top command, I found at peak, it consumed almost all memory. At stable period, with 36 ranks, each rank consumed
about 20GB memory. When I changed to iterative solvers with -ksp_type bcgs -pc_type gamg -mattransposematmult_via scalable, I did not meet errors seen on the smaller memory machine. But the residual did not converge. </div>
<div dir="ltr"> So, I think the errors you met were simply out of memory error either in superlu or in petsc. If you have machines with large memory, you can try it on. Otherwise, I let other petsc developers suggest better iterative solvers to you.</div>
<div> Thanks. </div>
<div dir="ltr">
<blockquote style="margin:0px 0px 0px 40px;border:medium none;padding:0px">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div><br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>
<div>
<div dir="ltr" class="gmail-m_-7631701108731935058m_-5476810978110225058gmail-m_-8651593192116187479gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail_attr">
On Wed, Jan 23, 2019 at 2:52 AM Sal Am <<a href="mailto:tempohoper@gmail.com" target="_blank">tempohoper@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>Sorry it took long had to see if I could shrink down the problem files from 50GB to something smaller (now ~10GB).</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>Can you compress your matrix and upload it to google drive, so we can try to reproduce the error.
</div>
</blockquote>
<div> </div>
<div>How I ran the problem: mpiexec valgrind --tool=memcheck --suppressions=$HOME/valgrind/valgrind-openmpi.supp -q --num-callers=20 --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it
1 -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged</div>
<div><br>
</div>
<div>here is the link to matrix A and vector b: <a href="https://drive.google.com/drive/folders/16YQPTK6TfXC6pV5RMdJ9g7X-ZiqbvwU8?usp=sharing" target="_blank">
https://drive.google.com/drive/folders/16YQPTK6TfXC6pV5RMdJ9g7X-ZiqbvwU8?usp=sharing</a></div>
<div><br>
</div>
<div>I redid the problem (twice) by trying to solve a 1M finite elements problem corresponding to ~ 4M n and 417M nnz matrix elements on the login shell which has ~550GB mem, but it failed. First time it failed because of bus error, second time it said killed.
I have attached valgrind file from both runs.<br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>OpenMPI is not my favorite. You need to use a suppressions file to get rid of all of that noise. Here is one:
</div>
</blockquote>
<div><br>
</div>
<div>Thanks I have been using it, but sometimes I still see same amount of errors.<br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail_attr">
On Fri, Jan 18, 2019 at 3:12 AM Zhang, Junchao <<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">Usually when I meet a SEGV error, I will run it again with a parallel debugger like DDT and wait for it to segfault, and then examine the stack trace to see what is wrong.
<div>Can you compress your matrix and upload it to google drive, so we can try to reproduce the error.</div>
<div>
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail_attr">
On Thu, Jan 17, 2019 at 10:44 AM Sal Am via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div>I did two runs, one with the SuperLU_dist and one with bcgs using jacobi, attached are the results of one of the reports from valgrind on one random processor (out of the 128 files).</div>
<div><br>
</div>
<div>DS = direct solver <br>
</div>
<div>IS = iterative solver <br>
</div>
<div><br>
</div>
<div>There is an awful lot of errors. <br>
</div>
<div><br>
</div>
<div>how I initiated the two runs: <br>
</div>
<div>mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-IS.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type jacobi -mattransposematmult_via scalable -build_twosided allreduce -ksp_monitor -log_view<br>
</div>
<div><br>
</div>
<div>mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-DS.%p ./solveCSys -malloc off -ksp_type gmres -pc_type lu -pc_factor_mat_solver_type superlu_dist -ksp_max_it 1 -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged
<br>
</div>
<div><br>
</div>
<div>Thank you <br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail_attr">
On Thu, Jan 17, 2019 at 4:24 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr">On Thu, Jan 17, 2019 at 9:18 AM Sal Am <<a href="mailto:tempohoper@gmail.com" target="_blank">tempohoper@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>1) Running out of memory</div>
<div><br>
</div>
<div>2) You passed an invalid array</div>
</blockquote>
<div>I have select=4:ncpus=32:mpiprocs=32:mem=300GB in the job script, i.e. using 300GB/node, a total of 1200GB memory, using 4 nodes and 32 processors per node (128 processors in total).
<br>
</div>
<div>I am not sure what would constitute an invalid array or how I can check that. I am using the same procedure as when dealing with the smaller matrix. i.e. Generate matrix A and vector b using FEM software then convert the matrix and vector using a python
script ready for petsc. read in petsc and calculate. <br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div>Are you running with 64-bit ints here?</div>
</div>
</blockquote>
<div>Yes I have it configured petsc with --with-64-bit-indices and debugging mode, which this was run on. </div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>It sounds like you have enough memory, but the fact that is runs for smaller problems makes me suspicious. It</div>
<div>could still be a memory overwrite. Can you either</div>
<div><br>
</div>
<div>a) Run under valgrind</div>
<div><br>
</div>
<div>or</div>
<div><br>
</div>
<div>b) Run under the debugger and get a stack trace</div>
<div><br>
</div>
<div> ?</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="gmail_quote">
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail-m_-209344416874752882gmail_attr">
On Thu, Jan 17, 2019 at 1:59 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr">On Thu, Jan 17, 2019 at 8:16 AM Sal Am <<a href="mailto:tempohoper@gmail.com" target="_blank">tempohoper@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>SuperLU_dist supports 64-bit ints. Are you not running in parallel?</div>
</blockquote>
<div>I will try that, although I think solving the real problem (later on if I can get this to work) with 30 million finite elements might be a problem for SuperLU_dist. so it is better to get an iterative solver to work with first.<br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div>1) Try using -build_twosided allreduce on this run</div>
</div>
</blockquote>
<div>How I ran it: mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-osaii.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg -mattransposematmult_via scalable -build_twosided allreduce -ksp_monitor -log_view</div>
<div>I have attached the full error output.</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>You are getting an SEGV on MatSetValues(), so its either</div>
<div><br>
</div>
<div>1) Running out of memory</div>
<div><br>
</div>
<div>2) You passed an invalid array</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div>2) Is it possible to get something that fails here but we can run. None of our tests show this problem.</div>
</div>
</blockquote>
<div>I am not how I can do that, but i have added my code which is quite short and should only read and solve the system, the problem arises at larger matrices for example current test case has 6 million finite elements (~2B non-zero numbers and 25M x 25M matrix). </div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Are you running with 64-bit ints here?</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="gmail_quote">
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail-m_-209344416874752882gmail-m_-2293910778617404818gmail-m_-2927092768489309897gmail_attr">
On Wed, Jan 16, 2019 at 1:12 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr">On Wed, Jan 16, 2019 at 3:52 AM Sal Am via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
The memory requested is an insane number. You may need to use 64 bit integers. </blockquote>
<div>Thanks Mark, I reconfigured it to use 64bit, however in the process it says I can no longer use MUMPS and SuperLU as they are not supported (I see on MUMPS webpage it supports 64int). However it does not exactly solve the problem.</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>SuperLU_dist supports 64-bit ints. Are you not running in parallel?</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>This time, it crashes at <span class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail-m_-209344416874752882gmail-m_-2293910778617404818gmail-m_-2927092768489309897gmail-m_6683578266348445832gmail-m_-3814543233888417572gmail-im">
<div>[6]PETSC ERROR: #1 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() line 1989 in /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c<br>
</div>
</span>
<div>ierr = PetscMalloc1(bi[pn]+1,&bj);<br>
</div>
<div>which allocates local portion of B^T*A. </div>
<div>You may also try to increase number of cores to reduce local matrix size.</div>
</div>
</blockquote>
<div><br>
</div>
<div>So I increased the number of cores to 16 on one node and ran it by :</div>
<div>mpiexec valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log-osa.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg -mattransposematmult_via scalable -ksp_monitor -log_view
<br>
</div>
<div>It crashed after reading in the matrix and before starting to solve. The error:
<br>
</div>
<div><br>
</div>
<div>[15]PETSC ERROR: [0]PETSC ERROR: ------------------------------------------------------------------------<br>
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end<br>
[0]PETSC ERROR: [1]PETSC ERROR: ------------------------------------------------------------------------<br>
[2]PETSC ERROR: ------------------------------------------------------------------------<br>
[2]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end<br>
[3]PETSC ERROR: ------------------------------------------------------------------------<br>
[4]PETSC ERROR: ------------------------------------------------------------------------<br>
[4]PETSC ERROR: [5]PETSC ERROR: [6]PETSC ERROR: ------------------------------------------------------------------------<br>
[8]PETSC ERROR: ------------------------------------------------------------------------<br>
[12]PETSC ERROR: ------------------------------------------------------------------------<br>
[12]PETSC ERROR: [14]PETSC ERROR: ------------------------------------------------------------------------<br>
[14]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end<br>
--------------------------------------------------------------------------<br>
mpiexec noticed that process rank 10 with PID 0 on node r03n01 exited on signal 9 (Killed).</div>
<div><br>
</div>
<div>Now I was running this with valgrind as someone had previously suggested and the 16 files created all contain the same type of error:</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Okay, its possible that there are bugs in the MPI implementation. So</div>
<div><br>
</div>
<div>1) Try using -build_twosided allreduce on this run</div>
<div><br>
</div>
<div>2) Is it possible to get something that fails here but we can run. None of our tests show this problem.</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>==25940== Invalid read of size 8<br>
==25940== at 0x5103326: PetscCheckPointer (checkptr.c:81)<br>
==25940== by 0x4F42058: PetscCommGetNewTag (tagm.c:77)<br>
==25940== by 0x4FC952D: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:373)<br>
==25940== by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)<br>
==25940== by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)<br>
==25940== by 0x52D6B42: VecAssemblyBegin (vector.c:140)<br>
==25940== by 0x5328C97: VecLoad_Binary (vecio.c:141)<br>
==25940== by 0x5329051: VecLoad_Default (vecio.c:516)<br>
==25940== by 0x52E0BAB: VecLoad (vector.c:933)<br>
==25940== by 0x4013D5: main (solveCmplxLinearSys.cpp:31)<br>
==25940== Address 0x19f807fc is 12 bytes inside a block of size 16 alloc'd<br>
==25940== at 0x4C2A603: memalign (vg_replace_malloc.c:899)<br>
==25940== by 0x4FD0B0E: PetscMallocAlign (mal.c:41)<br>
==25940== by 0x4FD23E7: PetscMallocA (mal.c:397)<br>
==25940== by 0x4FC948E: PetscCommBuildTwoSidedFReq_Ibarrier (mpits.c:371)<br>
==25940== by 0x4FCB29B: PetscCommBuildTwoSidedFReq (mpits.c:572)<br>
==25940== by 0x52BBFF4: VecAssemblyBegin_MPI_BTS (pbvec.c:251)<br>
==25940== by 0x52D6B42: VecAssemblyBegin (vector.c:140)<br>
==25940== by 0x5328C97: VecLoad_Binary (vecio.c:141)<br>
==25940== by 0x5329051: VecLoad_Default (vecio.c:516)<br>
==25940== by 0x52E0BAB: VecLoad (vector.c:933)<br>
==25940== by 0x4013D5: main (solveCmplxLinearSys.cpp:31)<br>
==25940==<br>
<br>
</div>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Mon, Jan 14, 2019 at 7:29 PM Zhang, Hong <<a href="mailto:hzhang@mcs.anl.gov" target="_blank">hzhang@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div dir="ltr">Fande:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div>According to this PR <a href="https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff" target="_blank">https://bitbucket.org/petsc/petsc/pull-requests/1061/a_selinger-feature-faster-scalable/diff</a><br>
</div>
<div><br>
</div>
<div>Should we set the scalable algorithm as default?</div>
</div>
</div>
</blockquote>
<div>Sure, we can. But I feel we need do more tests to compare scalable and non-scalable algorithms. </div>
<div>On theory, for small to medium matrices, non-scalable matmatmult() algorithm enables more efficient </div>
<div>data accessing. Andreas optimized scalable implementation. Our non-scalable implementation might have room to be further optimized. </div>
<div>Hong</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<div class="gmail_quote">
<div dir="ltr">On Fri, Jan 11, 2019 at 10:34 AM Zhang, Hong via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>Add option '-mattransposematmult_via scalable'</div>
<div>Hong</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Fri, Jan 11, 2019 at 9:52 AM Zhang, Junchao via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">I saw the following error message in your first email.</div>
</div>
<blockquote style="margin:0px 0px 0px 40px;border:medium none;padding:0px">
<div>
<div>
<div>
<div>[0]PETSC ERROR: Out of memory. This could be due to allocating</div>
</div>
</div>
</div>
<div>
<div>
<div>
<div>[0]PETSC ERROR: too large an object or bleeding by not properly</div>
</div>
</div>
</div>
<div>
<div>
<div>
<div>[0]PETSC ERROR: destroying unneeded objects.</div>
</div>
</div>
</div>
</blockquote>
<div dir="ltr">
<div dir="ltr">
<div>Probably the matrix is too large. You can try with more compute nodes, for example, use 8 nodes instead of 2, and see what happens.</div>
<div><br clear="all">
<div>
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail-m_-209344416874752882gmail-m_-2293910778617404818gmail-m_-2927092768489309897gmail-m_6683578266348445832gmail-m_-3814543233888417572gmail-m_3659884561732865489gmail-m_3948349922729883348gmail-m_-4238683586953860770gmail-m_-3160957232454552718gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Fri, Jan 11, 2019 at 7:45 AM Sal Am via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>Using a larger problem set with 2B non-zero elements and a matrix of 25M x 25M I get the following error:</div>
<div>[4]PETSC ERROR: ------------------------------------------------------------------------<br>
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br>
[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>
[4]PETSC ERROR: or see <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br>
[4]PETSC ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br>
[4]PETSC ERROR: likely location of problem given in stack below<br>
[4]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br>
[4]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br>
[4]PETSC ERROR: INSTEAD the line number of the start of the function<br>
[4]PETSC ERROR: is given.<br>
[4]PETSC ERROR: [4] MatCreateSeqAIJWithArrays line 4422 /lustre/home/vef002/petsc/src/mat/impls/aij/seq/aij.c<br>
[4]PETSC ERROR: [4] MatMatMultSymbolic_SeqAIJ_SeqAIJ line 747 /lustre/home/vef002/petsc/src/mat/impls/aij/seq/matmatmult.c<br>
[4]PETSC ERROR: [4] MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable line 1256 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c<br>
[4]PETSC ERROR: [4] MatTransposeMatMult_MPIAIJ_MPIAIJ line 1156 /lustre/home/vef002/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c<br>
[4]PETSC ERROR: [4] MatTransposeMatMult line 9950 /lustre/home/vef002/petsc/src/mat/interface/matrix.c<br>
[4]PETSC ERROR: [4] PCGAMGCoarsen_AGG line 871 /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/agg.c<br>
[4]PETSC ERROR: [4] PCSetUp_GAMG line 428 /lustre/home/vef002/petsc/src/ksp/pc/impls/gamg/gamg.c<br>
[4]PETSC ERROR: [4] PCSetUp line 894 /lustre/home/vef002/petsc/src/ksp/pc/interface/precon.c<br>
[4]PETSC ERROR: [4] KSPSetUp line 304 /lustre/home/vef002/petsc/src/ksp/ksp/interface/itfunc.c<br>
[4]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------<br>
[4]PETSC ERROR: Signal received<br>
[4]PETSC ERROR: See <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html" target="_blank">
http://www.mcs.anl.gov/petsc/documentation/faq.html</a> for trouble shooting.<br>
[4]PETSC ERROR: Petsc Release Version 3.10.2, unknown <br>
[4]PETSC ERROR: ./solveCSys on a linux-cumulus-debug named r02g03 by vef002 Fri Jan 11 09:13:23 2019<br>
[4]PETSC ERROR: Configure options PETSC_ARCH=linux-cumulus-debug --with-cc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicc --with-fc=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpifort --with-cxx=/usr/local/depot/openmpi-3.1.1-gcc-7.3.0/bin/mpicxx --download-parmetis
--download-metis --download-ptscotch --download-superlu_dist --download-mumps --with-scalar-type=complex --with-debugging=yes --download-scalapack --download-superlu --download-fblaslapack=1 --download-cmake<br>
[4]PETSC ERROR: #1 User provided function() line 0 in unknown file<br>
--------------------------------------------------------------------------<br>
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD<br>
with errorcode 59.<br>
<br>
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.<br>
You may or may not see output from other processes, depending on<br>
exactly when Open MPI kills them.<br>
--------------------------------------------------------------------------<br>
[0]PETSC ERROR: ------------------------------------------------------------------------<br>
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end<br>
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>
[0]PETSC ERROR: or see <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br>
</div>
<div><br>
</div>
<div>Using Valgrind on only one of the valgrind files the following error was written:</div>
<div><br>
</div>
<div>==9053== Invalid read of size 4<br>
==9053== at 0x5B8067E: MatCreateSeqAIJWithArrays (aij.c:4445)<br>
==9053== by 0x5BC2608: MatMatMultSymbolic_SeqAIJ_SeqAIJ (matmatmult.c:790)<br>
==9053== by 0x5D106F8: MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ_nonscalable (mpimatmatmult.c:1337)<br>
==9053== by 0x5D0E84E: MatTransposeMatMult_MPIAIJ_MPIAIJ (mpimatmatmult.c:1186)<br>
==9053== by 0x5457C57: MatTransposeMatMult (matrix.c:9984)<br>
==9053== by 0x64DD99D: PCGAMGCoarsen_AGG (agg.c:882)<br>
==9053== by 0x64C7527: PCSetUp_GAMG (gamg.c:522)<br>
==9053== by 0x6592AA0: PCSetUp (precon.c:932)<br>
==9053== by 0x66B1267: KSPSetUp (itfunc.c:391)<br>
==9053== by 0x4019A2: main (solveCmplxLinearSys.cpp:68)<br>
==9053== Address 0x8386997f4 is not stack'd, malloc'd or (recently) free'd<br>
==9053==<br>
<br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Fri, Jan 11, 2019 at 8:41 AM Sal Am <<a href="mailto:tempohoper@gmail.com" target="_blank">tempohoper@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div>Thank you Dave,</div>
<div><br>
</div>
<div>I reconfigured PETSc with valgrind and debugging mode, I ran the code again with the following options:</div>
<div>mpiexec -n 8 valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p ./solveCSys -malloc off -ksp_type bcgs -pc_type gamg -log_view</div>
<div>(as on the petsc website you linked)</div>
<div><br>
</div>
<div>It finished solving using the iterative solver, but the resulting valgrind.log.%p files (all 8 corresponding to each processor) are all empty. And it took a whooping ~15hours, for what used to take ~10-20min. Maybe this is because of valgrind? I am not
sure. Attached is the log_view.<br>
</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Thu, Jan 10, 2019 at 8:59 AM Dave May <<a href="mailto:dave.mayhem23@gmail.com" target="_blank">dave.mayhem23@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Thu, 10 Jan 2019 at 08:55, Sal Am via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>I am not sure what is exactly is wrong as the error changes slightly every time I run it (without changing the parameters).</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>This likely implies that you have a memory error in your code (a memory leak would not cause this behaviour).</div>
<div>I strongly suggest you make sure your code is free of memory errors.</div>
<div>You can do this using valgrind. See here </div>
<div><br>
</div>
<div><a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind" target="_blank">https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a><br>
</div>
<div><br>
</div>
<div>for an explanation of how to use valgrind.</div>
<div> <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>I have attached the first two run's errors and my code. <br>
</div>
<div><br>
</div>
<div>Is there a memory leak somewhere? I have tried running it with -malloc_dump, but not getting anything printed out, however, when run with -log_view I see that Viewer is created 4 times, but destroyed 3 times. The way I see it, I have destroyed it where
I see I no longer have use for it so not sure if I am wrong. Could this be the reason why it keeps crashing? It crashes as soon as it reads the matrix, before entering the solving mode (I have a print statement before solving starts that never prints).<br>
</div>
<div><br>
</div>
<div>how I run it in the job script on 2 node with 32 processors using the clusters OpenMPI.
<br>
</div>
<div><br>
</div>
<div>mpiexec ./solveCSys -ksp_type bcgs -pc_type gamg -ksp_converged_reason -ksp_monitor_true_residual -log_view -ksp_error_if_not_converged -ksp_monitor -malloc_log -ksp_view</div>
<div><br>
</div>
<div>the matrix:</div>
<div>2 122 821 366 (non-zero elements)<br>
</div>
<div>25 947 279 x 25 947 279<br>
</div>
<div><br>
</div>
<div>Thanks and all the best<br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail-m_-209344416874752882gmail-m_-2293910778617404818gmail-m_-2927092768489309897gmail-m_6683578266348445832gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail-m_-209344416874752882gmail-m_-2293910778617404818gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail-m_-7631701108731935058gmail-m_-5476810978110225058gmail-m_-8651593192116187479gmail-m_2271295961197387541gmail-m_9078405251732283825gmail-m_8401383685220485438gmail-m_8896819920196221123gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener</div>
<div><br>
</div>
<div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote></div>