[petsc-users] KSP breakdown in specific cluster (update)

Wed Apr 23 04:55:43 CDT 2014

Hi,

Just to update that I managed to compare the values by reducing the 
problem size to hundred plus values. The matrix and vector are almost 
the same compared to my win7 output.

Also tried valgrind but it aborts almost immediately:

valgrind --leak-check=yes ./a.out
==17603== Memcheck, a memory error detector.
==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==17603== Using LibVEX rev 1658, a library for dynamic binary translation.
==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==17603== For more details, rerun with: -v
==17603==
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
==17603== Your program just tried to execute an instruction that Valgrind
==17603== did not recognise.  There are two possible reasons for this.
==17603== 1. Your program has a bug and erroneously jumped to a non-code
==17603==    location.  If you are running Memcheck and you just saw a
==17603==    warning about a bad jump, it's probably your program's fault.
==17603== 2. The instruction is legitimate but Valgrind doesn't handle it,
==17603==    i.e. it's Valgrind's fault.  If you think this is the case or
==17603==    you are not sure, please let us know and we'll try to fix it.
==17603== Either way, Valgrind will now raise a SIGILL signal which will
==17603== probably kill your program.
forrtl: severe (168): Program Exception - illegal instruction
Image              PC                Routine Line        Source
libifcore.so.5     0000000005DD0F0E  Unknown Unknown  Unknown
libifcore.so.5     0000000005DD0DC7  Unknown Unknown  Unknown
a.out              0000000001CB4CBB  Unknown Unknown  Unknown
a.out              00000000004093DC  Unknown Unknown  Unknown
libc.so.6          000000369141D974  Unknown Unknown  Unknown
a.out              00000000004092E9  Unknown Unknown  Unknown
==17603==
==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
==17603== For counts of detected errors, rerun with: -v
==17603== searching for pointers to 8 not-freed blocks.
==17603== checked 2,340,280 bytes.
==17603==
==17603== LEAK SUMMARY:
==17603==    definitely lost: 0 bytes in 0 blocks.
==17603==      possibly lost: 0 bytes in 0 blocks.
==17603==    still reachable: 239 bytes in 8 blocks.
==17603==         suppressed: 0 bytes in 0 blocks.
==17603== Reachable blocks (those to which a pointer was found) are not 
shown.
==17603== To see them, rerun with: --show-reachable=yes

Thank you

Yours sincerely,

TAY wee-beng

On 23/4/2014 5:18 PM, TAY wee-beng wrote:
> Hi,
>
> My code was found to be giving error answer in one of the cluster, 
> even on single processor. No error msg was given. It used to be 
> working fine.
>
> I run the debug version and it gives the error msg:
>
> [0]PETSC ERROR: 
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 8 FPE: Floating Point 
> Exception,probably divide by zero
> [0]PETSC ERROR: Try option -start_in_debugger or 
> -on_error_attach_debugger
> [0]PETSC ERROR: or see 
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC 
> ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
> find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames 
> ------------------------------------
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
> available,
> [0]PETSC ERROR:       INSTEAD the line number of the start of the 
> function
> [0]PETSC ERROR:       is given.
> [0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c
> [0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c
> [0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.c
> [0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c
> [0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: --------------------- Error Message 
> ------------------------------------
> [0]PETSC ERROR: Signal received!
> [0]PETSC ERROR: 
> ------------------------------------------------------------------------
>
> It happens after KSPSolve. There was no problem on other cluster. So 
> how should I debug to find the error?
>
> I tried to compare the input matrix and vector between different 
> cluster but there are too many values.
>