[petsc-users] Tough to reproduce petsctablefind error

Matthew Knepley knepley at gmail.com
Fri Sep 25 12:37:58 CDT 2020


On Fri, Sep 25, 2020 at 1:29 PM Mark McClure <mark at resfrac.com> wrote:

> Hello,
>
> I work with Chris, and have a few comments, hopefully helpful. Thank you
> all, for your help.
>
> Our program is unfortunately behaving a little bit nondeterministically. I
> am not sure why because for the OpenMP parts, I test it for race conditions
> using Intel Inspector and see none. We are using Metis, not Parmetis. Do
> Petsc or Metis have any behavior that is nondeterministic? We will continue
> to investigate the cause of the nondeterminism. But that is a challenge for
> us reproducing this problem because we can run the same simulation 10
> times, and we do not get precisely the same result each time. In fact, for
> this bug, Chris did run it 10 times and did not reproduce.
>
> Every month, 1000s of hours of this simulator are being run on the
> Azure cluster. This crash has been occurring for months, but at infrequent
> intervals, and have never been able to reproduce it. As such, we can't
> generate an error dump, unless we gave all users a Petsc build with no
> optimization and waited weeks until a problem cropped up.
>
> Here's what we'll do - we'll internally make a build with debug and then
> run a large number of simulations until the problem reproduces and we get
> the error dump. That will take us some time, but we'll do it.
>
> Here is a bit more background that might be helpful. At first, we used
> OpenMPI 2.1.1-8 with Petsc 3.13.2. With that combo, we observed a memory
> leak, and simulations failed when the node ran out of RAM. Then we switched
> to MPICH3.3a2 and Petsc 3.13.3. The memory leak went away. That's when we
> started seeing this bug "greater than largest key allowed". It was
> unreproducible, but happening relatively more often than it is now (I
> think) - I was getting a couple user reports a week. Then, we updated to
> MPICH 3.3.2 and the same Petsc version (3.13.3). The problem seems to be
> less common - hadn't happened for the past month. But then it happened four
> times this week.
>
> Other background to note - our linear system very frequently changes
> size/shape. There will be 10,000s of Petsc solves with different matrices
> (different positions of nonzero values) over the course of a simulation. As
> far as we can tell, the crash only occurs only after the simulator has run
> for a long time, 10+ hours. Having said that, it does not appear to be a
> memory leak - at least, the node has plenty of remaining RAM when these
> crashes are occurring.
>

If it helps, I believe you can configure an optimized build to add -g,

  COPTFLAGS="-g" CPPOPTFLAGS="-g"

so that symbols are kept in the optimized libraries. That way we should be
able to get a stack on a run that performs well.

  Thanks,

    Matt


> Mark
>
>
> On Thu, Sep 24, 2020 at 4:41 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> You might add code here like:
>>
>> if (ierr) {
>>     for (; i<aij->B->rmap->n; i++) {
>>       for ( j<B->ilen[i]; j++) {
>>         PetscInt data,gid1 = aj[B->i[i] + j] + 1; // don't use ierr
>>         print rank, gid1;
>> }
>> CHKERRQ(ierr);
>>
>> I am guessing that somehow you have a table that is bad and too small. It
>> failed on an index not much bigger than the largest key allowed. Maybe just
>> compute the max and see if it goes much larger than the largest key allowed.
>>
>> If your mesh just changed to you know if it got bigger or smaller...
>>
>> Anyway just some thoughts,
>> Mark
>>
>>
>>
>> On Thu, Sep 24, 2020 at 4:18 PM Barry Smith <bsmith at petsc.dev> wrote:
>>
>>>
>>>
>>> On Sep 24, 2020, at 2:47 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>> On Thu, Sep 24, 2020 at 3:42 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>>   The stack is listed below. It crashes inside MatPtAP().
>>>>
>>>
>>> What about just checking that the column indices that PtAP receives are
>>> valid? Are we not doing that?
>>>
>>>
>>>   The code that checks for column too large in MatSetValues_MPIXXAIJ()
>>> is turned off for optimized builds, I am making a MR to always have it on.
>>> But I doubt this is the problem, other, more harsh crashes, are likely if
>>> the column index is too large.
>>>
>>>   This is difficult to debug because all we get is a stack trace. It
>>> would be good if we produce some information about the current state of the
>>> objects when the error is detected. We should think about what light-weight
>>> stuff we could report when errors are detected.
>>>
>>>
>>> Barry
>>>
>>>
>>>    Matt
>>>
>>>
>>>>   It is possible there is some subtle bug in the rather complex PETSc
>>>> code for MatPtAP() but I am included to blame MPI again.
>>>>
>>>>   I think we should add some simple low-overhead always on
>>>> communication error detecting code to PetscSF where some check sums are
>>>> also communicated at the highest level of PetscSF().
>>>>
>>>>    I don't know how but perhaps when the data is packed per destination
>>>> rank a checksum is computed and when unpacked the checksum is compared
>>>> using extra space at the end of the communicated packed array to store and
>>>> send the checksum. Yes, it is kind of odd for a high level library like
>>>> PETSc to not trust the communication channel but MPI implementations have
>>>> proven themselves to not be trustworthy and adding this to PetscSF is not
>>>> intrusive to the PETSc API or user. Plus it gives a definitive yes or no as
>>>> to the problem being from an error in the communication.
>>>>
>>>>   Barry
>>>>
>>>> On Sep 24, 2020, at 12:35 PM, Matthew Knepley <knepley at gmail.com>
>>>> wrote:
>>>>
>>>> On Thu, Sep 24, 2020 at 1:22 PM Chris Hewson <chris at resfrac.com> wrote:
>>>>
>>>>> Hi Guys,
>>>>>
>>>>> Thanks for all of the prompt responses, very helpful and appreciated.
>>>>>
>>>>> By "when debugging", did you mean when configure
>>>>> petsc --with-debugging=1 COPTFLAGS=-O0 -g etc or when you attached a
>>>>> debugger?
>>>>> - Both, I have run with a debugger attached and detached, all compiled
>>>>> with the following flags "--with-debugging=1 COPTFLAGS=-O0 -g"
>>>>>
>>>>> 1) Try OpenMPI (probably won't help, but worth trying)
>>>>> - Worth a try for sure
>>>>>
>>>>> 2) Find which part of the simulation makes it non-deterministic. Is it
>>>>> the mesh partitioning (parmetis)? Then try to make it deterministic.
>>>>> - Good tip, it is the mesh partitioning and along the lines of a
>>>>> question from Barry, the matrix size is changing. I will make this
>>>>> deterministic and give it a try
>>>>>
>>>>> 3) Dump matrices, vectors, etc and see when it fails, you can quickly
>>>>> reproduce the error by reading in the intermediate data.
>>>>> - Also a great suggestion, will give it a try
>>>>>
>>>>> The full stack would be really useful here. I am guessing this happens
>>>>> on MatMult(), but I do not know.
>>>>> - Agreed, I am currently running it so that the full stack will be
>>>>> produced, but waiting for it to fail, had compiled with all available
>>>>> optimizations on, but downside is of course if there is a failure.
>>>>> As a general question, roughly what's the performance impact on petsc
>>>>> with -o1/-o2/-o0 as opposed to -o3? Performance impact of --with-debugging
>>>>> = 1?
>>>>> Obviously problem/machine dependant, wondering on guidance more for
>>>>> this than anything
>>>>>
>>>>> Is the nonzero structure of your matrices changing or is it fixed for
>>>>> the entire simulation?
>>>>> The non-zero structure is changing, although the structures are
>>>>> reformed when this happens and this happens thousands of time before the
>>>>> failure has occured.
>>>>>
>>>>
>>>> Okay, this is the most likely spot for a bug. How are you changing the
>>>> matrix? It should be impossible to put in an invalid column index when
>>>> using MatSetValues()
>>>> because we check all the inputs. However, I do not think we check when
>>>> you just yank out the arrays.
>>>>
>>>>   Thanks,
>>>>
>>>>      Matt
>>>>
>>>>
>>>>> Does this particular run always crash at the same place? Similar
>>>>> place? Doesn't always crash?
>>>>> Doesn't always crash, but other similar runs have crashed in different
>>>>> spots, which makes it difficult to resolve. I am going to try out a few of
>>>>> the strategies suggested above and will let you know what comes of that.
>>>>>
>>>>> *Chris Hewson*
>>>>> Senior Reservoir Simulation Engineer
>>>>> ResFrac
>>>>> +1.587.575.9792
>>>>>
>>>>>
>>>>> On Thu, Sep 24, 2020 at 11:05 AM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>  Chris,
>>>>>>
>>>>>>    We realize how frustrating this type of problem is to deal with.
>>>>>>
>>>>>>    Here is the code:
>>>>>>
>>>>>>       ierr =
>>>>>> PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr);
>>>>>>     for (i=0; i<aij->B->rmap->n; i++) {
>>>>>>       for (j=0; j<B->ilen[i]; j++) {
>>>>>>         PetscInt data,gid1 = aj[B->i[i] + j] + 1;
>>>>>>         ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr);
>>>>>>         if (!data) {
>>>>>>           /* one based table */
>>>>>>           ierr =
>>>>>> PetscTableAdd(gid1_lid1,gid1,++ec,INSERT_VALUES);CHKERRQ(ierr);
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>
>>>>>>    It is simply looping over the rows of the sparse matrix putting
>>>>>> the columns it finds into the hash table.
>>>>>>
>>>>>>    aj[B->i[i] + j]  are the column entries, the number of columns in
>>>>>> the matrix is mat->cmap->N so the column entries should always be
>>>>>> less than the number of columns. The code is crashing when column
>>>>>> entry 24443 which is larger than the number of columns 23988.
>>>>>>
>>>>>> So either the aj[B->i[i] + j] + 1 are incorrect or the mat->cmap->N
>>>>>> is incorrect.
>>>>>>
>>>>>> 640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>  if (!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) {
>>>>>>     ierr = MatSetUpMultiply_MPIAIJ(mat);CHKERRQ(ierr);
>>>>>>   }
>>>>>>
>>>>>> Seems to indicate it is setting up a new multiple because it is
>>>>>> either the first time into the algorithm or the nonzero structure changed
>>>>>> on some rank requiring a new assembly process.
>>>>>>
>>>>>>    Is the nonzero structure of your matrices changing or is it fixed
>>>>>> for the entire simulation?
>>>>>>
>>>>>> Since the code has been running for a very long time already I have
>>>>>> to conclude that this is not the first time through and so something has
>>>>>> changed in the matrix?
>>>>>>
>>>>>> I think we have to put more diagnostics into the library to provide
>>>>>> more information before or at the time of the error detection.
>>>>>>
>>>>>>    Does this particular run always crash at the same place? Similar
>>>>>> place? Doesn't always crash?
>>>>>>
>>>>>>   Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 24, 2020, at 8:46 AM, Chris Hewson <chris at resfrac.com> wrote:
>>>>>>
>>>>>> After about a month of not having this issue pop up, it has come up
>>>>>> again
>>>>>>
>>>>>> We have been struggling with a similar PETSc Error for awhile now,
>>>>>> the error message is as follows:
>>>>>>
>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in
>>>>>> /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than
>>>>>> largest key allowed 23988
>>>>>>
>>>>>> It is a particularly nasty bug as it doesn't reproduce itself when
>>>>>> debugging and doesn't happen all the time with the same inputs either. The
>>>>>> problem occurs after a long runtime of the code (12-40 hours) and we are
>>>>>> using a ksp solve with KSPBCGS.
>>>>>>
>>>>>> The PETSc compilation options that are used are:
>>>>>>
>>>>>> --download-metis
>>>>>>     --download-mpich
>>>>>>     --download-mumps
>>>>>>     --download-parmetis
>>>>>>     --download-ptscotch
>>>>>>     --download-scalapack
>>>>>>     --download-suitesparse
>>>>>>     --prefix=/opt/anl/petsc-3.13.3
>>>>>>     --with-debugging=0
>>>>>>     --with-mpi=1
>>>>>>     COPTFLAGS=-O3 -march=haswell -mtune=haswell
>>>>>>     CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell
>>>>>>     FOPTFLAGS=-O3 -march=haswell -mtune=haswell
>>>>>>
>>>>>> This is being run across 8 processes and is failing consistently on
>>>>>> the rank 7 process. We also use openmp outside of PETSC and the linear
>>>>>> solve portion of the code. The rank 0 process is always using compute,
>>>>>> during this the slave processes use an MPI_Wait call to wait for the
>>>>>> collective parts of the code to be called again. All PETSC calls are done
>>>>>> across all of the processes.
>>>>>>
>>>>>> We are using mpich version 3.3.2, downloaded with the petsc 3.13.3
>>>>>> package.
>>>>>>
>>>>>> At every PETSC call we are checking the error return from the
>>>>>> function collectively to ensure that no errors have been returned from
>>>>>> PETSC.
>>>>>>
>>>>>> Some possible causes that I can think of are as follows:
>>>>>> 1. Memory leak causing a corruption either in our program or in petsc
>>>>>> or with one of the petsc objects. This seems unlikely as we have checked
>>>>>> runs with the option -malloc_dump for PETSc and using valgrind.
>>>>>>
>>>>>> 2. Optimization flags set for petsc compilation are causing variables
>>>>>> that go out of scope to be optimized out.
>>>>>>
>>>>>> 3. We are giving the wrong number of elements for a process or the
>>>>>> value is changing during the simulation. This seems unlikely as there is
>>>>>> nothing overly unique about these simulations and it's not reproducing
>>>>>> itself.
>>>>>>
>>>>>> 4. An MPI channel or socket error causing an error in the collective
>>>>>> values for PETSc.
>>>>>>
>>>>>> Any input on this issue would be greatly appreciated.
>>>>>>
>>>>>> *Chris Hewson*
>>>>>> Senior Reservoir Simulation Engineer
>>>>>> ResFrac
>>>>>> +1.587.575.9792
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <
>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>
>>>>>>> That is a great idea. I'll figure it out.
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsmith at petsc.dev>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>   Junchao,
>>>>>>>>
>>>>>>>>     Any way in the PETSc configure to warn that MPICH version is
>>>>>>>> "bad" or "untrustworthy" or even the vague "we have suspicians about this
>>>>>>>> version and recommend avoiding it"? A lot of time could be saved if others
>>>>>>>> don't deal with the same problem.
>>>>>>>>
>>>>>>>>     Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc
>>>>>>>> and always check against that list and print a boxed warning at configure
>>>>>>>> time? Better you could somehow generalize it and put it in package.py for
>>>>>>>> use by all packages, then any package can included lists of "suspect"
>>>>>>>> versions. (There are definitely HDF5 versions that should be avoided :-)).
>>>>>>>>
>>>>>>>>     Barry
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <
>>>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>>>
>>>>>>>> Thanks for the update. Let's assume it is a bug in MPI :)
>>>>>>>> --Junchao Zhang
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <chris at resfrac.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Just as an update to this, I can confirm that using the mpich
>>>>>>>>> version (3.3.2) downloaded with the petsc download solved this issue on my
>>>>>>>>> end.
>>>>>>>>>
>>>>>>>>> *Chris Hewson*
>>>>>>>>> Senior Reservoir Simulation Engineer
>>>>>>>>> ResFrac
>>>>>>>>> +1.587.575.9792
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <
>>>>>>>>> junchao.zhang at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsmith at petsc.dev>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     Is there a comprehensive  MPI test suite (perhaps from
>>>>>>>>>>> MPICH)? Is there any way to run this full test suite under the problematic
>>>>>>>>>>> MPI and see if it detects any problems?
>>>>>>>>>>>
>>>>>>>>>>>      Is so, could someone add it to the FAQ in the debugging
>>>>>>>>>>> section?
>>>>>>>>>>>
>>>>>>>>>> MPICH does have a test suite. It is at the subdir test/mpi of
>>>>>>>>>> downloaded mpich
>>>>>>>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>.
>>>>>>>>>> It annoyed me since it is not user-friendly.  It might be helpful in
>>>>>>>>>> catching bugs at very small scale. But say if I want to test allreduce on
>>>>>>>>>> 1024 ranks on 100 doubles, I have to hack the test suite.
>>>>>>>>>> Anyway, the instructions are here.
>>>>>>>>>>
>>>>>>>>>> For the purpose of petsc, under test/mpi one can configure it with
>>>>>>>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi
>>>>>>>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast
>>>>>>>>>> --disable-spawn --disable-cxx --disable-ft-tests  // It is weird I disabled
>>>>>>>>>> cxx but I had to set CXX!
>>>>>>>>>> $make -k -j8  // -k is to keep going and ignore compilation
>>>>>>>>>> errors, e.g., when building tests for MPICH extensions not in MPI standard,
>>>>>>>>>> but your MPI is OpenMPI.
>>>>>>>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those
>>>>>>>>>> are sub-dirs containing tests for MPI routines Petsc does not rely on.
>>>>>>>>>> $ make testings or directly './runtests -tests=testlist'
>>>>>>>>>>
>>>>>>>>>> On a batch system,
>>>>>>>>>> $export MPITEST_BATCHDIR=`pwd`/btest       // specify a batch
>>>>>>>>>> dir, say btest,
>>>>>>>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist   //
>>>>>>>>>> Use 1024 ranks if a test does no specify the number of processes.
>>>>>>>>>> $ // It copies test binaries to the batch dir and generates a
>>>>>>>>>> script runtests.batch there.  Edit the script to fit your batch system and
>>>>>>>>>> then submit a job and wait for its finish.
>>>>>>>>>> $ cd btest && ../checktests --ignorebogus
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> PS: Fande, changing an MPI fixed your problem does not
>>>>>>>>>> necessarily mean the old MPI has bugs. It is complicated. It could be a
>>>>>>>>>> petsc bug.  You need to provide us a code to reproduce your error. It does
>>>>>>>>>> not matter if the code is big.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>     Thanks
>>>>>>>>>>>
>>>>>>>>>>>       Barry
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong.jd at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Trace could look like this:
>>>>>>>>>>>
>>>>>>>>>>> [640]PETSC ERROR: --------------------- Error Message
>>>>>>>>>>> --------------------------------------------------------------
>>>>>>>>>>> [640]PETSC ERROR: Argument out of range
>>>>>>>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key
>>>>>>>>>>> allowed 740521
>>>>>>>>>>> [640]PETSC ERROR: See
>>>>>>>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for
>>>>>>>>>>> trouble shooting.
>>>>>>>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown
>>>>>>>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named
>>>>>>>>>>> r6i5n18 by wangy2 Sun Jul 19 17:14:28 2020
>>>>>>>>>>> [640]PETSC ERROR: Configure options --download-hypre=1
>>>>>>>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1
>>>>>>>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1
>>>>>>>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1
>>>>>>>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11
>>>>>>>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices
>>>>>>>>>>> --download-mumps=0
>>>>>>>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h
>>>>>>>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c
>>>>>>>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>>>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce()
>>>>>>>>>>> line 901 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c
>>>>>>>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce()
>>>>>>>>>>> line 3180 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c
>>>>>>>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>>>>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c
>>>>>>>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>>>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c
>>>>>>>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c
>>>>>>>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c
>>>>>>>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c
>>>>>>>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c
>>>>>>>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c
>>>>>>>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in
>>>>>>>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong.jd at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am not entirely sure what is happening, but we encountered
>>>>>>>>>>>> similar issues recently.  It was not reproducible. It might occur at
>>>>>>>>>>>> different stages, and errors could be weird other than "ctable stuff." Our
>>>>>>>>>>>> code was Valgrind clean since every PR in moose needs to go through
>>>>>>>>>>>> rigorous Valgrind checks before it reaches the devel branch.  The errors
>>>>>>>>>>>> happened when we used mvapich.
>>>>>>>>>>>>
>>>>>>>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then
>>>>>>>>>>>> everything was smooth.  May you try a different MPI? It is better to try a
>>>>>>>>>>>> system carried one.
>>>>>>>>>>>>
>>>>>>>>>>>> We did not get the bottom of this problem yet, but we at least
>>>>>>>>>>>> know this is kind of MPI-related.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Fande,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <chris at resfrac.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am having a bug that is occurring in PETSC with the return
>>>>>>>>>>>>> string:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in
>>>>>>>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater than
>>>>>>>>>>>>> largest key allowed 5693
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is using petsc-3.13.2, compiled and running using mpich
>>>>>>>>>>>>> with -O3 and debugging turned off tuned to the haswell architecture and
>>>>>>>>>>>>> occurring either before or during a KSPBCGS solve/setup or during a MUMPS
>>>>>>>>>>>>> factorization solve (I haven't been able to replicate this issue with the
>>>>>>>>>>>>> same set of instructions etc.).
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a terrible way to ask a question, I know, and not very
>>>>>>>>>>>>> helpful from your side, but this is what I have from a user's run and can't
>>>>>>>>>>>>> reproduce on my end (either with the optimization compilation or with
>>>>>>>>>>>>> debugging turned on). This happens when the code has run for quite some
>>>>>>>>>>>>> time and is happening somewhat rarely.
>>>>>>>>>>>>>
>>>>>>>>>>>>> More than likely I am using a static variable (code is written
>>>>>>>>>>>>> in c++) that I'm not updating when the matrix size is changing or something
>>>>>>>>>>>>> silly like that, but any help or guidance on this would be appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Chris Hewson*
>>>>>>>>>>>>> Senior Reservoir Simulation Engineer
>>>>>>>>>>>>> ResFrac
>>>>>>>>>>>>> +1.587.575.9792
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>>
>>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>>
>>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200925/1ebbd6bf/attachment-0001.html>


More information about the petsc-users mailing list