[petsc-users] Tough to reproduce petsctablefind error
Junchao Zhang
junchao.zhang at gmail.com
Thu Sep 24 15:31:06 CDT 2020
That error stack was from Fande Kong. We should wait for Chris's.
--Junchao Zhang
On Thu, Sep 24, 2020 at 2:42 PM Barry Smith <bsmith at petsc.dev> wrote:
>
> The stack is listed below. It crashes inside MatPtAP().
>
> It is possible there is some subtle bug in the rather complex PETSc code
> for MatPtAP() but I am included to blame MPI again.
>
> I think we should add some simple low-overhead always on communication
> error detecting code to PetscSF where some check sums are also communicated
> at the highest level of PetscSF().
>
> I don't know how but perhaps when the data is packed per destination
> rank a checksum is computed and when unpacked the checksum is compared
> using extra space at the end of the communicated packed array to store and
> send the checksum. Yes, it is kind of odd for a high level library like
> PETSc to not trust the communication channel but MPI implementations have
> proven themselves to not be trustworthy and adding this to PetscSF is not
> intrusive to the PETSc API or user. Plus it gives a definitive yes or no as
> to the problem being from an error in the communication.
>
> Barry
>
> On Sep 24, 2020, at 12:35 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
> On Thu, Sep 24, 2020 at 1:22 PM Chris Hewson <chris at resfrac.com> wrote:
>
>> Hi Guys,
>>
>> Thanks for all of the prompt responses, very helpful and appreciated.
>>
>> By "when debugging", did you mean when configure petsc --with-debugging=1
>> COPTFLAGS=-O0 -g etc or when you attached a debugger?
>> - Both, I have run with a debugger attached and detached, all compiled
>> with the following flags "--with-debugging=1 COPTFLAGS=-O0 -g"
>>
>> 1) Try OpenMPI (probably won't help, but worth trying)
>> - Worth a try for sure
>>
>> 2) Find which part of the simulation makes it non-deterministic. Is it
>> the mesh partitioning (parmetis)? Then try to make it deterministic.
>> - Good tip, it is the mesh partitioning and along the lines of a question
>> from Barry, the matrix size is changing. I will make this deterministic and
>> give it a try
>>
>> 3) Dump matrices, vectors, etc and see when it fails, you can quickly
>> reproduce the error by reading in the intermediate data.
>> - Also a great suggestion, will give it a try
>>
>> The full stack would be really useful here. I am guessing this happens on
>> MatMult(), but I do not know.
>> - Agreed, I am currently running it so that the full stack will be
>> produced, but waiting for it to fail, had compiled with all available
>> optimizations on, but downside is of course if there is a failure.
>> As a general question, roughly what's the performance impact on petsc
>> with -o1/-o2/-o0 as opposed to -o3? Performance impact of --with-debugging
>> = 1?
>> Obviously problem/machine dependant, wondering on guidance more for this
>> than anything
>>
>> Is the nonzero structure of your matrices changing or is it fixed for the
>> entire simulation?
>> The non-zero structure is changing, although the structures are reformed
>> when this happens and this happens thousands of time before the failure has
>> occured.
>>
>
> Okay, this is the most likely spot for a bug. How are you changing the
> matrix? It should be impossible to put in an invalid column index when
> using MatSetValues()
> because we check all the inputs. However, I do not think we check when you
> just yank out the arrays.
>
> Thanks,
>
> Matt
>
>
>> Does this particular run always crash at the same place? Similar place?
>> Doesn't always crash?
>> Doesn't always crash, but other similar runs have crashed in different
>> spots, which makes it difficult to resolve. I am going to try out a few of
>> the strategies suggested above and will let you know what comes of that.
>>
>> *Chris Hewson*
>> Senior Reservoir Simulation Engineer
>> ResFrac
>> +1.587.575.9792
>>
>>
>> On Thu, Sep 24, 2020 at 11:05 AM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>> Chris,
>>>
>>> We realize how frustrating this type of problem is to deal with.
>>>
>>> Here is the code:
>>>
>>> ierr =
>>> PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr);
>>> for (i=0; i<aij->B->rmap->n; i++) {
>>> for (j=0; j<B->ilen[i]; j++) {
>>> PetscInt data,gid1 = aj[B->i[i] + j] + 1;
>>> ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr);
>>> if (!data) {
>>> /* one based table */
>>> ierr =
>>> PetscTableAdd(gid1_lid1,gid1,++ec,INSERT_VALUES);CHKERRQ(ierr);
>>> }
>>> }
>>> }
>>>
>>> It is simply looping over the rows of the sparse matrix putting the
>>> columns it finds into the hash table.
>>>
>>> aj[B->i[i] + j] are the column entries, the number of columns in the
>>> matrix is mat->cmap->N so the column entries should always be
>>> less than the number of columns. The code is crashing when column entry
>>> 24443 which is larger than the number of columns 23988.
>>>
>>> So either the aj[B->i[i] + j] + 1 are incorrect or the mat->cmap->N is
>>> incorrect.
>>>
>>> 640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>>
>>>>>>>>
>>> if (!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) {
>>> ierr = MatSetUpMultiply_MPIAIJ(mat);CHKERRQ(ierr);
>>> }
>>>
>>> Seems to indicate it is setting up a new multiple because it is either
>>> the first time into the algorithm or the nonzero structure changed on some
>>> rank requiring a new assembly process.
>>>
>>> Is the nonzero structure of your matrices changing or is it fixed
>>> for the entire simulation?
>>>
>>> Since the code has been running for a very long time already I have to
>>> conclude that this is not the first time through and so something has
>>> changed in the matrix?
>>>
>>> I think we have to put more diagnostics into the library to provide more
>>> information before or at the time of the error detection.
>>>
>>> Does this particular run always crash at the same place? Similar
>>> place? Doesn't always crash?
>>>
>>> Barry
>>>
>>>
>>>
>>>
>>> On Sep 24, 2020, at 8:46 AM, Chris Hewson <chris at resfrac.com> wrote:
>>>
>>> After about a month of not having this issue pop up, it has come up again
>>>
>>> We have been struggling with a similar PETSc Error for awhile now, the
>>> error message is as follows:
>>>
>>> [7]PETSC ERROR: PetscTableFind() line 132 in
>>> /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than
>>> largest key allowed 23988
>>>
>>> It is a particularly nasty bug as it doesn't reproduce itself when
>>> debugging and doesn't happen all the time with the same inputs either. The
>>> problem occurs after a long runtime of the code (12-40 hours) and we are
>>> using a ksp solve with KSPBCGS.
>>>
>>> The PETSc compilation options that are used are:
>>>
>>> --download-metis
>>> --download-mpich
>>> --download-mumps
>>> --download-parmetis
>>> --download-ptscotch
>>> --download-scalapack
>>> --download-suitesparse
>>> --prefix=/opt/anl/petsc-3.13.3
>>> --with-debugging=0
>>> --with-mpi=1
>>> COPTFLAGS=-O3 -march=haswell -mtune=haswell
>>> CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell
>>> FOPTFLAGS=-O3 -march=haswell -mtune=haswell
>>>
>>> This is being run across 8 processes and is failing consistently on the
>>> rank 7 process. We also use openmp outside of PETSC and the linear solve
>>> portion of the code. The rank 0 process is always using compute, during
>>> this the slave processes use an MPI_Wait call to wait for the collective
>>> parts of the code to be called again. All PETSC calls are done across all
>>> of the processes.
>>>
>>> We are using mpich version 3.3.2, downloaded with the petsc 3.13.3
>>> package.
>>>
>>> At every PETSC call we are checking the error return from the function
>>> collectively to ensure that no errors have been returned from PETSC.
>>>
>>> Some possible causes that I can think of are as follows:
>>> 1. Memory leak causing a corruption either in our program or in petsc or
>>> with one of the petsc objects. This seems unlikely as we have checked runs
>>> with the option -malloc_dump for PETSc and using valgrind.
>>>
>>> 2. Optimization flags set for petsc compilation are causing variables
>>> that go out of scope to be optimized out.
>>>
>>> 3. We are giving the wrong number of elements for a process or the value
>>> is changing during the simulation. This seems unlikely as there is nothing
>>> overly unique about these simulations and it's not reproducing itself.
>>>
>>> 4. An MPI channel or socket error causing an error in the collective
>>> values for PETSc.
>>>
>>> Any input on this issue would be greatly appreciated.
>>>
>>> *Chris Hewson*
>>> Senior Reservoir Simulation Engineer
>>> ResFrac
>>> +1.587.575.9792
>>>
>>>
>>> On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>> That is a great idea. I'll figure it out.
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>> Junchao,
>>>>>
>>>>> Any way in the PETSc configure to warn that MPICH version is "bad"
>>>>> or "untrustworthy" or even the vague "we have suspicians about this version
>>>>> and recommend avoiding it"? A lot of time could be saved if others don't
>>>>> deal with the same problem.
>>>>>
>>>>> Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and
>>>>> always check against that list and print a boxed warning at configure time?
>>>>> Better you could somehow generalize it and put it in package.py for use by
>>>>> all packages, then any package can included lists of "suspect" versions.
>>>>> (There are definitely HDF5 versions that should be avoided :-)).
>>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thanks for the update. Let's assume it is a bug in MPI :)
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <chris at resfrac.com>
>>>>> wrote:
>>>>>
>>>>>> Just as an update to this, I can confirm that using the mpich version
>>>>>> (3.3.2) downloaded with the petsc download solved this issue on my end.
>>>>>>
>>>>>> *Chris Hewson*
>>>>>> Senior Reservoir Simulation Engineer
>>>>>> ResFrac
>>>>>> +1.587.575.9792
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <
>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>
>>>>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsmith at petsc.dev>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Is there a comprehensive MPI test suite (perhaps from MPICH)?
>>>>>>>> Is there any way to run this full test suite under the problematic MPI and
>>>>>>>> see if it detects any problems?
>>>>>>>>
>>>>>>>> Is so, could someone add it to the FAQ in the debugging
>>>>>>>> section?
>>>>>>>>
>>>>>>> MPICH does have a test suite. It is at the subdir test/mpi of
>>>>>>> downloaded mpich
>>>>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>.
>>>>>>> It annoyed me since it is not user-friendly. It might be helpful in
>>>>>>> catching bugs at very small scale. But say if I want to test allreduce on
>>>>>>> 1024 ranks on 100 doubles, I have to hack the test suite.
>>>>>>> Anyway, the instructions are here.
>>>>>>>
>>>>>>> For the purpose of petsc, under test/mpi one can configure it with
>>>>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi
>>>>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast
>>>>>>> --disable-spawn --disable-cxx --disable-ft-tests // It is weird I disabled
>>>>>>> cxx but I had to set CXX!
>>>>>>> $make -k -j8 // -k is to keep going and ignore compilation errors,
>>>>>>> e.g., when building tests for MPICH extensions not in MPI standard, but
>>>>>>> your MPI is OpenMPI.
>>>>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are
>>>>>>> sub-dirs containing tests for MPI routines Petsc does not rely on.
>>>>>>> $ make testings or directly './runtests -tests=testlist'
>>>>>>>
>>>>>>> On a batch system,
>>>>>>> $export MPITEST_BATCHDIR=`pwd`/btest // specify a batch dir,
>>>>>>> say btest,
>>>>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist // Use
>>>>>>> 1024 ranks if a test does no specify the number of processes.
>>>>>>> $ // It copies test binaries to the batch dir and generates a
>>>>>>> script runtests.batch there. Edit the script to fit your batch system and
>>>>>>> then submit a job and wait for its finish.
>>>>>>> $ cd btest && ../checktests --ignorebogus
>>>>>>>
>>>>>>>
>>>>>>> PS: Fande, changing an MPI fixed your problem does not
>>>>>>> necessarily mean the old MPI has bugs. It is complicated. It could be a
>>>>>>> petsc bug. You need to provide us a code to reproduce your error. It does
>>>>>>> not matter if the code is big.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong.jd at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Trace could look like this:
>>>>>>>>
>>>>>>>> [640]PETSC ERROR: --------------------- Error Message
>>>>>>>> --------------------------------------------------------------
>>>>>>>> [640]PETSC ERROR: Argument out of range
>>>>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed
>>>>>>>> 740521
>>>>>>>> [640]PETSC ERROR: See
>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>>>>> shooting.
>>>>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown
>>>>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18
>>>>>>>> by wangy2 Sun Jul 19 17:14:28 2020
>>>>>>>> [640]PETSC ERROR: Configure options --download-hypre=1
>>>>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1
>>>>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1
>>>>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1
>>>>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11
>>>>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices
>>>>>>>> --download-mumps=0
>>>>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h
>>>>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c
>>>>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line
>>>>>>>> 901 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c
>>>>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line
>>>>>>>> 3180 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c
>>>>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c
>>>>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c
>>>>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c
>>>>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c
>>>>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in
>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c
>>>>>>>>
>>>>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am not entirely sure what is happening, but we encountered
>>>>>>>>> similar issues recently. It was not reproducible. It might occur at
>>>>>>>>> different stages, and errors could be weird other than "ctable stuff." Our
>>>>>>>>> code was Valgrind clean since every PR in moose needs to go through
>>>>>>>>> rigorous Valgrind checks before it reaches the devel branch. The errors
>>>>>>>>> happened when we used mvapich.
>>>>>>>>>
>>>>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything
>>>>>>>>> was smooth. May you try a different MPI? It is better to try a system
>>>>>>>>> carried one.
>>>>>>>>>
>>>>>>>>> We did not get the bottom of this problem yet, but we at least
>>>>>>>>> know this is kind of MPI-related.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Fande,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <chris at resfrac.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am having a bug that is occurring in PETSC with the return
>>>>>>>>>> string:
>>>>>>>>>>
>>>>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in
>>>>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater than
>>>>>>>>>> largest key allowed 5693
>>>>>>>>>>
>>>>>>>>>> This is using petsc-3.13.2, compiled and running using mpich with
>>>>>>>>>> -O3 and debugging turned off tuned to the haswell architecture and
>>>>>>>>>> occurring either before or during a KSPBCGS solve/setup or during a MUMPS
>>>>>>>>>> factorization solve (I haven't been able to replicate this issue with the
>>>>>>>>>> same set of instructions etc.).
>>>>>>>>>>
>>>>>>>>>> This is a terrible way to ask a question, I know, and not very
>>>>>>>>>> helpful from your side, but this is what I have from a user's run and can't
>>>>>>>>>> reproduce on my end (either with the optimization compilation or with
>>>>>>>>>> debugging turned on). This happens when the code has run for quite some
>>>>>>>>>> time and is happening somewhat rarely.
>>>>>>>>>>
>>>>>>>>>> More than likely I am using a static variable (code is written in
>>>>>>>>>> c++) that I'm not updating when the matrix size is changing or something
>>>>>>>>>> silly like that, but any help or guidance on this would be appreciated.
>>>>>>>>>>
>>>>>>>>>> *Chris Hewson*
>>>>>>>>>> Senior Reservoir Simulation Engineer
>>>>>>>>>> ResFrac
>>>>>>>>>> +1.587.575.9792
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/6be01a71/attachment-0001.html>
More information about the petsc-users
mailing list