[petsc-users] KSP breakdown in specific cluster (update)

TAY wee-beng zonexo at gmail.com
Wed Apr 23 20:12:25 CDT 2014


On 23/4/2014 6:00 PM, Matthew Knepley wrote:
> On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <zonexo at gmail.com 
> <mailto:zonexo at gmail.com>> wrote:
>
>     Hi,
>
>     Just to update that I managed to compare the values by reducing
>     the problem size to hundred plus values. The matrix and vector are
>     almost the same compared to my win7 output.
>
>
> Run in the debugger and get a stack trace,
Hi,

I use -start_in_debugger option and it hangs at this point:

Program received signal SIGFPE, Arithmetic exception.
VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
71          ierr = PetscLogFlops(2.0*xin->map->n-1);CHKERRQ(ierr);
(gdb) where
#0  VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at 
bvec1.c:71
#1  0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0,
     z=0x7fff24cd7f40) at pbvec.c:15
#2  0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0,
     val=0x7fff24cd7f40) at rvector.c:128
#3  0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
#4  0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110, 
x=0x14771890)
     at itfunc.c:441
#5  0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650, 
x=0x3959f38,
     __ierr=0x384d8b8) at itfuncf.c:219
#6  0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
#7  0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
#8  0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
#9  0x00000000004093dc in main ()
(gdb)

Is this what you mean by a stack trace?

I have also used "bt full" and I have attached a more detailed output.
>
>    Matt
>
>     Also tried valgrind but it aborts almost immediately:
>
>     valgrind --leak-check=yes ./a.out
>     ==17603== Memcheck, a memory error detector.
>     ==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward
>     et al.
>     ==17603== Using LibVEX rev 1658, a library for dynamic binary
>     translation.
>     ==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
>     ==17603== Using valgrind-3.2.1, a dynamic binary instrumentation
>     framework.
>     ==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward
>     et al.
>     ==17603== For more details, rerun with: -v
>     ==17603==
>     --17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
>     --17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10
>     vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0
>     ==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.
>     ==17603== Your program just tried to execute an instruction that
>     Valgrind
>     ==17603== did not recognise.  There are two possible reasons for this.
>     ==17603== 1. Your program has a bug and erroneously jumped to a
>     non-code
>     ==17603==    location.  If you are running Memcheck and you just saw a
>     ==17603==    warning about a bad jump, it's probably your
>     program's fault.
>     ==17603== 2. The instruction is legitimate but Valgrind doesn't
>     handle it,
>     ==17603==    i.e. it's Valgrind's fault.  If you think this is the
>     case or
>     ==17603==    you are not sure, please let us know and we'll try to
>     fix it.
>     ==17603== Either way, Valgrind will now raise a SIGILL signal
>     which will
>     ==17603== probably kill your program.
>     forrtl: severe (168): Program Exception - illegal instruction
>     Image              PC                Routine Line  Source
>     libifcore.so.5     0000000005DD0F0E  Unknown Unknown  Unknown
>     libifcore.so.5     0000000005DD0DC7  Unknown Unknown  Unknown
>     a.out              0000000001CB4CBB  Unknown Unknown  Unknown
>     a.out              00000000004093DC  Unknown Unknown  Unknown
>     libc.so.6          000000369141D974  Unknown Unknown  Unknown
>     a.out              00000000004092E9  Unknown Unknown  Unknown
>     ==17603==
>     ==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5
>     from 1)
>     ==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.
>     ==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.
>     ==17603== For counts of detected errors, rerun with: -v
>     ==17603== searching for pointers to 8 not-freed blocks.
>     ==17603== checked 2,340,280 bytes.
>     ==17603==
>     ==17603== LEAK SUMMARY:
>     ==17603==    definitely lost: 0 bytes in 0 blocks.
>     ==17603==      possibly lost: 0 bytes in 0 blocks.
>     ==17603==    still reachable: 239 bytes in 8 blocks.
>     ==17603==         suppressed: 0 bytes in 0 blocks.
>     ==17603== Reachable blocks (those to which a pointer was found)
>     are not shown.
>     ==17603== To see them, rerun with: --show-reachable=yes
>
>     Thank you
>
>     Yours sincerely,
>
>     TAY wee-beng
>
>     On 23/4/2014 5:18 PM, TAY wee-beng wrote:
>
>         Hi,
>
>         My code was found to be giving error answer in one of the
>         cluster, even on single processor. No error msg was given. It
>         used to be working fine.
>
>         I run the debug version and it gives the error msg:
>
>         [0]PETSC ERROR:
>         ------------------------------------------------------------------------
>         [0]PETSC ERROR: Caught signal number 8 FPE: Floating Point
>         Exception,probably divide by zero
>         [0]PETSC ERROR: Try option -start_in_debugger or
>         -on_error_attach_debugger
>         [0]PETSC ERROR: or see
>         http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC
>         ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>         OS X to find memory corruption errors
>         [0]PETSC ERROR: likely location of problem given in stack below
>         [0]PETSC ERROR: ---------------------  Stack Frames
>         ------------------------------------
>         [0]PETSC ERROR: Note: The EXACT line numbers in the stack are
>         not available,
>         [0]PETSC ERROR:       INSTEAD the line number of the start of
>         the function
>         [0]PETSC ERROR:       is given.
>         [0]PETSC ERROR: [0] VecDot_Seq line 62
>         src/vec/vec/impls/seq/bvec1.c
>         [0]PETSC ERROR: [0] VecDot_MPI line 14
>         src/vec/vec/impls/mpi/pbvec.c
>         [0]PETSC ERROR: [0] VecDot line 118
>         src/vec/vec/interface/rvector.c
>         [0]PETSC ERROR: [0] KSPSolve_BCGS line 39
>         src/ksp/ksp/impls/bcgs/bcgs.c
>         [0]PETSC ERROR: [0] KSPSolve line 356
>         src/ksp/ksp/interface/itfunc.c
>         [0]PETSC ERROR: --------------------- Error Message
>         ------------------------------------
>         [0]PETSC ERROR: Signal received!
>         [0]PETSC ERROR:
>         ------------------------------------------------------------------------
>
>         It happens after KSPSolve. There was no problem on other
>         cluster. So how should I debug to find the error?
>
>         I tried to compare the input matrix and vector between
>         different cluster but there are too many values.
>
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their 
> experiments is infinitely more interesting than any results to which 
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140424/6eec371d/attachment-0001.html>
-------------- next part --------------
(gdb) bt full
#0  VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at bvec1.c:71
        ya = (const PetscScalar *) 0x0
        xa = (const PetscScalar *) 0x0
        one = 1
        bn = 960
        ierr = 0
#1  0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940, yin=0x14ad8fb0, 
    z=0x7fff24cd7f40) at pbvec.c:15
        sum = 1.9762625833649862e-323
        work = 1.3431953209405154
        ierr = 0
#2  0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0, 
    val=0x7fff24cd7f40) at rvector.c:128
        ierr = 0
#3  0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at bcgs.c:85
        ierr = 0
        i = 0
        rho = 1.9762625833649862e-323
        rhoold = 1
        alpha = 1
        beta = 1.600807474747106e-316
        omega = 1.6910452843641213e-315
        omegaold = 1
---Type <return> to continue, or q <return> to quit---
        d1 = 0
        X = (Vec) 0x14771890
        B = (Vec) 0x1476b110
        V = (Vec) 0x14ade620
        P = (Vec) 0x14aee970
        R = (Vec) 0x14ad3940
        RP = (Vec) 0x14ad8fb0
        T = (Vec) 0x14ae3c90
        S = (Vec) 0x14ae9300
        dp = 1.1589630369172761
        d2 = 1.5718032521948665e-316
        bcgs = (KSP_BCGS *) 0x14abdc40
#4  0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110, x=0x14771890)
    at itfunc.c:441
        ierr = 0
        rank = 32767
        flag1 = PETSC_FALSE
        flag2 = PETSC_FALSE
        flag3 = PETSC_FALSE
        flg = PETSC_FALSE
        inXisinB = PETSC_FALSE
        guess_zero = PETSC_TRUE
        viewer = (PetscViewer) 0x7fff0000000b
---Type <return> to continue, or q <return> to quit---
        mat = (Mat) 0x0
        premat = (Mat) 0x7fff24ce0cf8
        format = PETSC_VIEWER_DEFAULT
#5  0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650, x=0x3959f38, 
    __ierr=0x384d8b8) at itfuncf.c:219
No locals.
#6  0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_ ()
No symbol table info available.
#7  0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()
No symbol table info available.
#8  0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675
        filename = <error reading variable>
        file_write_no_char = <error reading variable>
        del_t_tmp = 0
        chord = 0
        sum_w = 317.24634223818254
        sum_v = -0.029308797839978005
        sum_u = -0.26935975976548698
        max_w = 1.1569511841851332
        max_v = 0.15264060063132492
        max_u = 0.23053472715538209
        explode_check = 10
        error_all = 0
---Type <return> to continue, or q <return> to quit---
        ijk = 1
        escape_time = 99990000
        error = 0
        interval2 = -16000
        openstatus = {0, 0, 0, 0, 0, 0}
#9  0x00000000004093dc in main ()
No symbol table info available.


More information about the petsc-users mailing list