[petsc-users] KSP breakdown in specific cluster (update)

Wed Apr 23 20:44:31 CDT 2014

On 24/4/2014 9:41 AM, Barry Smith wrote:
>    The numbers like
>
>          sum = 1.9762625833649862e-323
>
>          rho = 1.9762625833649862e-323
>
>          beta = 1.600807474747106e-316
>          omega = 1.6910452843641213e-315
>          d2 = 1.5718032521948665e-316
>
>     are nonsense. They would generally indicate that something is wrong, but unfortunately don’t point to exactly what is wrong.

Hi,

In that case, how do I troubleshoot? Any suggestions?

Thanks.
>
>
>     Barry
>
> On Apr 23, 2014, at 8:12 PM, TAY wee-beng <zonexo at gmail.com> wrote:
>
>> On 23/4/2014 6:00 PM, Matthew Knepley wrote:
>>> On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>> Hi,
>>>
>>> Just to update that I managed to compare the values by reducing the problem size to hundred plus values. The matrix and vector are almost the same compared to my win7 output.
>>>
>>> Run in the debugger and get a stack trace,
>> Hi,
>>
>> I use -start_in_debugger option and it hangs at this point:
>>
>> Program received signal SIGFPE, Arithmetic exception.
>> VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
>> 71          ierr = PetscLogFlops(2.0*xin->map->n-1);CHKERRQ(ierr);
>> (gdb) where
>> #0  VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
>> #1  0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0,
>>      z=0x7fff24cd7f40) at pbvec.c:15
>> #2  0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0,
>>      val=0x7fff24cd7f40) at rvector.c:128
>> #3  0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
>> #4  0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110, x=0x14771890)
>>      at itfunc.c:441
>> #5  0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650, x=0x3959f38,
>>      __ierr=0x384d8b8) at itfuncf.c:219
>> #6  0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
>> #7  0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
>> #8  0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
>> #9  0x00000000004093dc in main ()
>> (gdb)
>>
>> Is this what you mean by a stack trace?
>>
>> I have also used "bt full" and I have attached a more detailed output.
>>>     Matt
>>>   
>>> Also tried valgrind but it aborts almost immediately:
>>>
>>> valgrind --leak-check=yes ./a.out
>>> ==17603== Memcheck, a memory error detector.
>>> ==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
>>> ==17603== Using LibVEX rev 1658, a library for dynamic binary translation.
>>> ==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
>>> ==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
>>> ==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
>>> ==17603== For more details, rerun with: -v
>>> ==17603==
>>> --17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
>>> --17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
>>> vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
>>> ==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
>>> ==17603== Your program just tried to execute an instruction that Valgrind
>>> ==17603== did not recognise.  There are two possible reasons for this.
>>> ==17603== 1. Your program has a bug and erroneously jumped to a non-code
>>> ==17603==    location.  If you are running Memcheck and you just saw a
>>> ==17603==    warning about a bad jump, it's probably your program's fault.
>>> ==17603== 2. The instruction is legitimate but Valgrind doesn't handle it,
>>> ==17603==    i.e. it's Valgrind's fault.  If you think this is the case or
>>> ==17603==    you are not sure, please let us know and we'll try to fix it.
>>> ==17603== Either way, Valgrind will now raise a SIGILL signal which will
>>> ==17603== probably kill your program.
>>> forrtl: severe (168): Program Exception - illegal instruction
>>> Image              PC                Routine Line        Source
>>> libifcore.so.5     0000000005DD0F0E  Unknown Unknown  Unknown
>>> libifcore.so.5     0000000005DD0DC7  Unknown Unknown  Unknown
>>> a.out              0000000001CB4CBB  Unknown Unknown  Unknown
>>> a.out              00000000004093DC  Unknown Unknown  Unknown
>>> libc.so.6          000000369141D974  Unknown Unknown  Unknown
>>> a.out              00000000004092E9  Unknown Unknown  Unknown
>>> ==17603==
>>> ==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
>>> ==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
>>> ==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
>>> ==17603== For counts of detected errors, rerun with: -v
>>> ==17603== searching for pointers to 8 not-freed blocks.
>>> ==17603== checked 2,340,280 bytes.
>>> ==17603==
>>> ==17603== LEAK SUMMARY:
>>> ==17603==    definitely lost: 0 bytes in 0 blocks.
>>> ==17603==      possibly lost: 0 bytes in 0 blocks.
>>> ==17603==    still reachable: 239 bytes in 8 blocks.
>>> ==17603==         suppressed: 0 bytes in 0 blocks.
>>> ==17603== Reachable blocks (those to which a pointer was found) are not shown.
>>> ==17603== To see them, rerun with: --show-reachable=yes
>>>
>>> Thank you
>>>
>>> Yours sincerely,
>>>
>>> TAY wee-beng
>>>
>>> On 23/4/2014 5:18 PM, TAY wee-beng wrote:
>>> Hi,
>>>
>>> My code was found to be giving error answer in one of the cluster, even on single processor. No error msg was given. It used to be working fine.
>>>
>>> I run the debug version and it gives the error msg:
>>>
>>> [0]PETSC ERROR: ------------------------------------------------------------------------
>>> [0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably divide by zero
>>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>> [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
>>> [0]PETSC ERROR: likely location of problem given in stack below
>>> [0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
>>> [0]PETSC ERROR:       is given.
>>> [0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c
>>> [0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c
>>> [0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.c
>>> [0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c
>>> [0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c
>>> [0]PETSC ERROR: --------------------- Error Message ------------------------------------
>>> [0]PETSC ERROR: Signal received!
>>> [0]PETSC ERROR: ------------------------------------------------------------------------
>>>
>>> It happens after KSPSolve. There was no problem on other cluster. So how should I debug to find the error?
>>>
>>> I tried to compare the input matrix and vector between different cluster but there are too many values.
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>> <stack.txt>