[petsc-users] error when solving a linear system with gmres + pilut/euclid

Barry Smith bsmith at petsc.dev
Tue Aug 25 08:23:27 CDT 2020


  Sounds like it might be a compiler problem generating bad code. 

  On the machine where it fails you can run with -fp_trap to have it error out as soon as a Nan or Inf appears. If you can use the debugger on that machine you can tell the debugger to catch floating point exceptions and see the exact line an values of variables where a Nan or Inf appear.

   As Matt conjectured it is likely there is a divide by zero before PETSc detects and it may be helpful to find out exactly where that happens.

  Barry


> On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <ajaramillopalma at gmail.com> wrote:
> 
> Yes, Barry, that is correct.
> 
> 
> 
> On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> 
>   On one system you get this error, on another system with the identical code and test case you do not get the error?
> 
>   You get it with three iterative methods but not with MUMPS?
> 
> Barry
> 
> 
>> On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <ajaramillopalma at gmail.com <mailto:ajaramillopalma at gmail.com>> wrote:
>> 
>> Hello Barry, Matthew, thanks for the replies !
>> 
>> Yes, it is our custom code, and it also happens when setting -pc_type bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues.
>> 
>> Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce any problem.
>> 
>> To reproduce the situation on my computer, I was able to reproduce the error for a small case and -pc_type bjacobi. For that particular case, when running in the cluster the error appears at the very last iteration:
>> 
>> =====
>> 27 KSP Residual norm 8.230378644666e-06 
>> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>> [0]PETSC ERROR: Invalid argument
>> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>> ====
>> 
>> whereas running on my computer the error is not launched and convergence is reached instead:
>> 
>> ====
>> Linear interp_ solve converged due to CONVERGED_RTOL iterations 27
>> ====
>> 
>> I will run valgrind to seek for possible memory corruptions.
>> 
>> thank you
>> Alfredo
>> 
>> On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>> 
>>    Oh yes, it could happen with Nan. 
>> 
>>    KSPGMRESClassicalGramSchmidtOrthogonalization() calls  KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and set ksp->convergedreason  but the call to MAXPY() is still made before returning and hence producing the error message.
>> 
>>    We should circuit the orthogonalization as soon as it sees a Nan/Inf and return immediately for GMRES to cleanup and produce a very useful error message. 
>> 
>>   Alfredo,
>> 
>>     It is also possible that the hypre preconditioners are producing a Nan because your matrix is too difficult for them to handle, but it would be odd to happen after many iterations.
>> 
>>    As I suggested before run with -pc_type bjacobi to see if you get the same problem.
>> 
>>   Barry
>> 
>> 
>>> On Aug 24, 2020, at 6:38 PM, Matthew Knepley <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>>> 
>>> On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>> 
>>>    Alfredo,
>>> 
>>>       This should never happen. The input to the VecMAXPY in gmres is computed via VMDot which produces the same result on all processes.
>>> 
>>>        If you run with -pc_type bjacobi does it also happen?
>>> 
>>>        Is this your custom code or does it happen in PETSc examples also? Like src/snes/tutorials/ex19 -da_refine 5 
>>> 
>>>       Could be memory corruption, can you run under valgrind?
>>> 
>>> Couldn't it happen if something generates a NaN? That also should not happen, but I was allowing that pilut might do it.
>>> 
>>>   Thanks,
>>> 
>>>     Matt
>>>  
>>>     Barry
>>> 
>>> 
>>> > On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo <ajaramillopalma at gmail.com <mailto:ajaramillopalma at gmail.com>> wrote:
>>> > 
>>> > Dear PETSc developers,
>>> > 
>>> > I'm trying to solve a linear problem with GMRES preconditioned with pilut from HYPRE. For this I'm using the options:
>>> > 
>>> > -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor
>>> > 
>>> > If I use a single core, GMRES (+ pilut or euclid) converges. However, when using multiple cores the next error appears after some number of iterations:
>>> > 
>>> > [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>>> > 
>>> > relative to the function VecMAXPY. I attached a screenshot with more detailed output. The same happens when using euclid. Can you please give me some insight on this?
>>> > 
>>> > best regards
>>> > Alfredo
>>> > <Screenshot from 2020-08-24 17-57-52.png>
>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200825/6f9af4bf/attachment.html>


More information about the petsc-users mailing list