[petsc-users] KSP breakdown in specific cluster (update)
TAY wee-beng
zonexo at gmail.com
Wed Apr 23 04:55:43 CDT 2014
Hi,
Just to update that I managed to compare the values by reducing the
problem size to hundred plus values. The matrix and vector are almost
the same compared to my win7 output.
Also tried valgrind but it aborts almost immediately:
valgrind --leak-check=yes ./a.out
==17603== Memcheck, a memory error detector.
==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==17603== Using LibVEX rev 1658, a library for dynamic binary translation.
==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==17603== For more details, rerun with: -v
==17603==
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
==17603== Your program just tried to execute an instruction that Valgrind
==17603== did not recognise. There are two possible reasons for this.
==17603== 1. Your program has a bug and erroneously jumped to a non-code
==17603== location. If you are running Memcheck and you just saw a
==17603== warning about a bad jump, it's probably your program's fault.
==17603== 2. The instruction is legitimate but Valgrind doesn't handle it,
==17603== i.e. it's Valgrind's fault. If you think this is the case or
==17603== you are not sure, please let us know and we'll try to fix it.
==17603== Either way, Valgrind will now raise a SIGILL signal which will
==17603== probably kill your program.
forrtl: severe (168): Program Exception - illegal instruction
Image PC Routine Line Source
libifcore.so.5 0000000005DD0F0E Unknown Unknown Unknown
libifcore.so.5 0000000005DD0DC7 Unknown Unknown Unknown
a.out 0000000001CB4CBB Unknown Unknown Unknown
a.out 00000000004093DC Unknown Unknown Unknown
libc.so.6 000000369141D974 Unknown Unknown Unknown
a.out 00000000004092E9 Unknown Unknown Unknown
==17603==
==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
==17603== For counts of detected errors, rerun with: -v
==17603== searching for pointers to 8 not-freed blocks.
==17603== checked 2,340,280 bytes.
==17603==
==17603== LEAK SUMMARY:
==17603== definitely lost: 0 bytes in 0 blocks.
==17603== possibly lost: 0 bytes in 0 blocks.
==17603== still reachable: 239 bytes in 8 blocks.
==17603== suppressed: 0 bytes in 0 blocks.
==17603== Reachable blocks (those to which a pointer was found) are not
shown.
==17603== To see them, rerun with: --show-reachable=yes
Thank you
Yours sincerely,
TAY wee-beng
On 23/4/2014 5:18 PM, TAY wee-beng wrote:
> Hi,
>
> My code was found to be giving error answer in one of the cluster,
> even on single processor. No error msg was given. It used to be
> working fine.
>
> I run the debug version and it gives the error msg:
>
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 8 FPE: Floating Point
> Exception,probably divide by zero
> [0]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [0]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC
> ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to
> find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: --------------------- Stack Frames
> ------------------------------------
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [0]PETSC ERROR: INSTEAD the line number of the start of the
> function
> [0]PETSC ERROR: is given.
> [0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c
> [0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c
> [0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.c
> [0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c
> [0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [0]PETSC ERROR: Signal received!
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
>
> It happens after KSPSolve. There was no problem on other cluster. So
> how should I debug to find the error?
>
> I tried to compare the input matrix and vector between different
> cluster but there are too many values.
>
More information about the petsc-users
mailing list