[petsc-users] Tough to reproduce petsctablefind error

Chris Hewson chris at resfrac.com
Mon Jul 20 18:00:00 CDT 2020


Do not use mpich v3.3a2, which is an alpha version released in 2016.  Use
current stable release mpich-3.3.2
- Thanks Junchao, that makes sense also with Fande's observations. I will
give this a try and see

*Chris Hewson*
Senior Reservoir Simulation Engineer
ResFrac
+1.587.575.9792


On Mon, Jul 20, 2020 at 4:20 PM Fande Kong <fdkong.jd at gmail.com> wrote:

>
>
> On Mon, Jul 20, 2020 at 1:14 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> This is indeed a nasty bug, but having two separate should be useful.
>>
>> Chris is using Haswell, what MPI are you using? I trust you are not using
>> Moose.
>>
>> Fande what machine/MPI are you using?
>>
>
> #define PETSC_MPICC_SHOW
> "/apps/local/spack/software/gcc-4.8.5/gcc-9.2.0-bxc7mvbmrfcrusa6ij7ux3exfqabmq5y/bin/gcc
> -I/apps/local/mvapich2/2.3.3-gcc-9.2.0/include
> -L/apps/local/mvapich2/2.3.3-gcc-9.2.0/lib -Wl,-rpath
> -Wl,/apps/local/mvapich2/2.3.3-gcc-9.2.0/lib -Wl,--enable-new-dtags -lmpi"
>
> I guess it is mvapich2-2.3.3.
>
> Here is the machine configuration https://www.top500.org/system/179708/
>
>
> BTW (if you missed my earlier posts), if I switch to MPT-MPI (a vendor
> installed MPI), everything runs well so far.
>
> I will stick with MPT from now.
>
> Thanks,
>
> Fande,
>
>
>
>>
>> On Mon, Jul 20, 2020 at 3:04 PM Chris Hewson <chris at resfrac.com> wrote:
>>
>>> Hi Mark,
>>>
>>> Chris: It sounds like you just have one matrix that you give to MUMPS.
>>> You seem to be creating a matrix in the middle of your run. Are you doing
>>> dynamic adaptivity?
>>> - I have 2 separate matrices I give to mumps, but as this is happening
>>> in the production build of my code, I can't determine with certainty what
>>> call to MUMPS it's happening or what call to KSPBCGS or UMFPACK it's
>>> happening in.
>>>
>>> I do destroy and recreate matrices in the middle of my runs, but this
>>> happens multiple times before the fault happens and in (presumably) the
>>> same way. I also do checks on matrix sizes and what I am sending to PETSc
>>> and those all pass, just at some point there are size mismatches
>>> somewhere, understandably this is not a lot to go on. I am not doing
>>> dynamic adaptivity, the mesh is instead changing its size.
>>>
>>> And I agree with Fande, the most frustrating part is that it's not
>>> reproducible, but yah not 100% sure that the problem lies within the PETSc
>>> code base either.
>>>
>>> Current working theories are:
>>> 1. Some sort of MPI problem with the sending of one the matrix elements
>>> (using mpich version 3.3a2)
>>> 2. Some of the memory of static pointers gets corrupted, although I
>>> would expect a garbage number and not something that could possibly make
>>> sense.
>>>
>>> *Chris Hewson*
>>> Senior Reservoir Simulation Engineer
>>> ResFrac
>>> +1.587.575.9792
>>>
>>>
>>> On Mon, Jul 20, 2020 at 12:41 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jul 20, 2020 at 2:36 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> Just to be clear, I do not think it is related to GAMG or PtAP. It is
>>>>> a communication issue:
>>>>>
>>>>
>>>> Youe stack trace was from PtAP, but Chris's problem is not.
>>>>
>>>>
>>>>>
>>>>> Reran the same code, and I just got :
>>>>>
>>>>> [252]PETSC ERROR: --------------------- Error Message
>>>>> --------------------------------------------------------------
>>>>> [252]PETSC ERROR: Petsc has generated inconsistent data
>>>>> [252]PETSC ERROR: Received vector entry 4469094877509280860 out of
>>>>> local range [255426072,256718616)]
>>>>>
>>>>
>>>> OK, now this (4469094877509280860) is clearly garbage. THat is the
>>>> important thing.  I have to think your MPI is buggy.
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200720/630bb3c3/attachment-0001.html>


More information about the petsc-users mailing list