[petsc-users] Tough to reproduce petsctablefind error

Matthew Knepley knepley at gmail.com
Thu Sep 24 11:32:30 CDT 2020


On Thu, Sep 24, 2020 at 9:47 AM Chris Hewson <chris at resfrac.com> wrote:

> After about a month of not having this issue pop up, it has come up again
>
> We have been struggling with a similar PETSc Error for awhile now, the
> error message is as follows:
>
> [7]PETSC ERROR: PetscTableFind() line 132 in
> /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than
> largest key allowed 23988
>

The full stack would be really useful here. I am guessing this happens on
MatMult(), but I do not know.

  Thanks,

     Matt


> It is a particularly nasty bug as it doesn't reproduce itself when
> debugging and doesn't happen all the time with the same inputs either. The
> problem occurs after a long runtime of the code (12-40 hours) and we are
> using a ksp solve with KSPBCGS.
>
> The PETSc compilation options that are used are:
>
> --download-metis
>     --download-mpich
>     --download-mumps
>     --download-parmetis
>     --download-ptscotch
>     --download-scalapack
>     --download-suitesparse
>     --prefix=/opt/anl/petsc-3.13.3
>     --with-debugging=0
>     --with-mpi=1
>     COPTFLAGS=-O3 -march=haswell -mtune=haswell
>     CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell
>     FOPTFLAGS=-O3 -march=haswell -mtune=haswell
>
> This is being run across 8 processes and is failing consistently on the
> rank 7 process. We also use openmp outside of PETSC and the linear solve
> portion of the code. The rank 0 process is always using compute, during
> this the slave processes use an MPI_Wait call to wait for the collective
> parts of the code to be called again. All PETSC calls are done across all
> of the processes.
>
> We are using mpich version 3.3.2, downloaded with the petsc 3.13.3 package.
>
> At every PETSC call we are checking the error return from the function
> collectively to ensure that no errors have been returned from PETSC.
>
> Some possible causes that I can think of are as follows:
> 1. Memory leak causing a corruption either in our program or in petsc or
> with one of the petsc objects. This seems unlikely as we have checked runs
> with the option -malloc_dump for PETSc and using valgrind.
>
> 2. Optimization flags set for petsc compilation are causing variables that
> go out of scope to be optimized out.
>
> 3. We are giving the wrong number of elements for a process or the value
> is changing during the simulation. This seems unlikely as there is nothing
> overly unique about these simulations and it's not reproducing itself.
>
> 4. An MPI channel or socket error causing an error in the collective
> values for PETSc.
>
> Any input on this issue would be greatly appreciated.
>
> *Chris Hewson*
> Senior Reservoir Simulation Engineer
> ResFrac
> +1.587.575.9792
>
>
> On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> That is a great idea. I'll figure it out.
>> --Junchao Zhang
>>
>>
>> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>   Junchao,
>>>
>>>     Any way in the PETSc configure to warn that MPICH version is "bad"
>>> or "untrustworthy" or even the vague "we have suspicians about this version
>>> and recommend avoiding it"? A lot of time could be saved if others don't
>>> deal with the same problem.
>>>
>>>     Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and
>>> always check against that list and print a boxed warning at configure time?
>>> Better you could somehow generalize it and put it in package.py for use by
>>> all packages, then any package can included lists of "suspect" versions.
>>> (There are definitely HDF5 versions that should be avoided :-)).
>>>
>>>     Barry
>>>
>>>
>>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>> Thanks for the update. Let's assume it is a bug in MPI :)
>>> --Junchao Zhang
>>>
>>>
>>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <chris at resfrac.com> wrote:
>>>
>>>> Just as an update to this, I can confirm that using the mpich version
>>>> (3.3.2) downloaded with the petsc download solved this issue on my end.
>>>>
>>>> *Chris Hewson*
>>>> Senior Reservoir Simulation Engineer
>>>> ResFrac
>>>> +1.587.575.9792
>>>>
>>>>
>>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>>     Is there a comprehensive  MPI test suite (perhaps from MPICH)? Is
>>>>>> there any way to run this full test suite under the problematic MPI and see
>>>>>> if it detects any problems?
>>>>>>
>>>>>>      Is so, could someone add it to the FAQ in the debugging section?
>>>>>>
>>>>> MPICH does have a test suite. It is at the subdir test/mpi of
>>>>> downloaded mpich
>>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>.  It
>>>>> annoyed me since it is not user-friendly.  It might be helpful in catching
>>>>> bugs at very small scale. But say if I want to test allreduce on 1024 ranks
>>>>> on 100 doubles, I have to hack the test suite.
>>>>> Anyway, the instructions are here.
>>>>>
>>>>> For the purpose of petsc, under test/mpi one can configure it with
>>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi
>>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast
>>>>> --disable-spawn --disable-cxx --disable-ft-tests  // It is weird I disabled
>>>>> cxx but I had to set CXX!
>>>>> $make -k -j8  // -k is to keep going and ignore compilation errors,
>>>>> e.g., when building tests for MPICH extensions not in MPI standard, but
>>>>> your MPI is OpenMPI.
>>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are
>>>>> sub-dirs containing tests for MPI routines Petsc does not rely on.
>>>>> $ make testings or directly './runtests -tests=testlist'
>>>>>
>>>>> On a batch system,
>>>>> $export MPITEST_BATCHDIR=`pwd`/btest       // specify a batch dir, say
>>>>> btest,
>>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist   // Use
>>>>> 1024 ranks if a test does no specify the number of processes.
>>>>> $ // It copies test binaries to the batch dir and generates a
>>>>> script runtests.batch there.  Edit the script to fit your batch system and
>>>>> then submit a job and wait for its finish.
>>>>> $ cd btest && ../checktests --ignorebogus
>>>>>
>>>>>
>>>>> PS: Fande, changing an MPI fixed your problem does not
>>>>> necessarily mean the old MPI has bugs. It is complicated. It could be a
>>>>> petsc bug.  You need to provide us a code to reproduce your error. It does
>>>>> not matter if the code is big.
>>>>>
>>>>>
>>>>>>     Thanks
>>>>>>
>>>>>>       Barry
>>>>>>
>>>>>>
>>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong.jd at gmail.com> wrote:
>>>>>>
>>>>>> Trace could look like this:
>>>>>>
>>>>>> [640]PETSC ERROR: --------------------- Error Message
>>>>>> --------------------------------------------------------------
>>>>>> [640]PETSC ERROR: Argument out of range
>>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed
>>>>>> 740521
>>>>>> [640]PETSC ERROR: See
>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>>> shooting.
>>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown
>>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18 by
>>>>>> wangy2 Sun Jul 19 17:14:28 2020
>>>>>> [640]PETSC ERROR: Configure options --download-hypre=1
>>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1
>>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1
>>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1
>>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11
>>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices
>>>>>> --download-mumps=0
>>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h
>>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c
>>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line
>>>>>> 901 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c
>>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line
>>>>>> 3180 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c
>>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c
>>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c
>>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c
>>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c
>>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in
>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c
>>>>>>
>>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am not entirely sure what is happening, but we encountered similar
>>>>>>> issues recently.  It was not reproducible. It might occur at different
>>>>>>> stages, and errors could be weird other than "ctable stuff." Our code was
>>>>>>> Valgrind clean since every PR in moose needs to go through rigorous
>>>>>>> Valgrind checks before it reaches the devel branch.  The errors happened
>>>>>>> when we used mvapich.
>>>>>>>
>>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything
>>>>>>> was smooth.  May you try a different MPI? It is better to try a system
>>>>>>> carried one.
>>>>>>>
>>>>>>> We did not get the bottom of this problem yet, but we at least know
>>>>>>> this is kind of MPI-related.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Fande,
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <chris at resfrac.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am having a bug that is occurring in PETSC with the return string:
>>>>>>>>
>>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in
>>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater than
>>>>>>>> largest key allowed 5693
>>>>>>>>
>>>>>>>> This is using petsc-3.13.2, compiled and running using mpich with
>>>>>>>> -O3 and debugging turned off tuned to the haswell architecture and
>>>>>>>> occurring either before or during a KSPBCGS solve/setup or during a MUMPS
>>>>>>>> factorization solve (I haven't been able to replicate this issue with the
>>>>>>>> same set of instructions etc.).
>>>>>>>>
>>>>>>>> This is a terrible way to ask a question, I know, and not very
>>>>>>>> helpful from your side, but this is what I have from a user's run and can't
>>>>>>>> reproduce on my end (either with the optimization compilation or with
>>>>>>>> debugging turned on). This happens when the code has run for quite some
>>>>>>>> time and is happening somewhat rarely.
>>>>>>>>
>>>>>>>> More than likely I am using a static variable (code is written in
>>>>>>>> c++) that I'm not updating when the matrix size is changing or something
>>>>>>>> silly like that, but any help or guidance on this would be appreciated.
>>>>>>>>
>>>>>>>> *Chris Hewson*
>>>>>>>> Senior Reservoir Simulation Engineer
>>>>>>>> ResFrac
>>>>>>>> +1.587.575.9792
>>>>>>>>
>>>>>>>
>>>>>>
>>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200924/e9628bb0/attachment-0001.html>


More information about the petsc-users mailing list