<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 23/4/2014 6:00 PM, Matthew Knepley
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAMYG4GkFHs8M-QKNyF70ixa513gHHVoRfqZ2P8+GHYg3LaoV6g@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">On Wed, Apr 23, 2014 at 5:55 AM, TAY
            wee-beng <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:zonexo@gmail.com" target="_blank">zonexo@gmail.com</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
              <br>
              Just to update that I managed to compare the values by
              reducing the problem size to hundred plus values. The
              matrix and vector are almost the same compared to my win7
              output.<br>
            </blockquote>
            <div><br>
            </div>
            <div>Run in the debugger and get a stack trace,</div>
          </div>
        </div>
      </div>
    </blockquote>
    Hi,<br>
    <br>
    I use -start_in_debugger option and it hangs at this point:<br>
    <br>
    Program received signal SIGFPE, Arithmetic exception.<br>
    VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at
    bvec1.c:71<br>
    71          ierr =
    PetscLogFlops(2.0*xin->map->n-1);CHKERRQ(ierr);<br>
    (gdb) where<br>
    #0  VecDot_Seq (xin=0x14ad3940, yin=0x14ad8fb0, z=0x7fff24cd79b8) at
    bvec1.c:71<br>
    #1  0x0000000001f1d8b5 in VecDot_MPI (xin=0x14ad3940,
    yin=0x14ad8fb0, <br>
        z=0x7fff24cd7f40) at pbvec.c:15<br>
    #2  0x0000000001edfa14 in VecDot (x=0x14ad3940, y=0x14ad8fb0, <br>
        val=0x7fff24cd7f40) at rvector.c:128<br>
    #3  0x00000000025cf539 in KSPSolve_BCGS (ksp=0x1479d910) at
    bcgs.c:85<br>
    #4  0x0000000002576687 in KSPSolve (ksp=0x1479d910, b=0x1476b110,
    x=0x14771890)<br>
        at itfunc.c:441<br>
    #5  0x0000000001d859d9 in kspsolve_ (ksp=0x395a548, b=0x395a650,
    x=0x3959f38, <br>
        __ierr=0x384d8b8) at itfuncf.c:219<br>
    #6  0x0000000001c37def in petsc_solvers_mp_semi_momentum_simple_xyz_
    ()<br>
    #7  0x0000000001c97c02 in fractional_initial_mp_fractional_steps_ ()<br>
    #8  0x0000000001cbc336 in ibm3d_high_re () at ibm3d_high_Re.F90:675<br>
    #9  0x00000000004093dc in main ()<br>
    (gdb) <br>
    <br>
    Is this what you mean by a stack trace?<br>
    <br>
    I have also used "bt full" and I have attached a more detailed
    output.<br>
    <blockquote
cite="mid:CAMYG4GkFHs8M-QKNyF70ixa513gHHVoRfqZ2P8+GHYg3LaoV6g@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>   Matt</div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              Also tried valgrind but it aborts almost immediately:<br>
              <br>
              valgrind --leak-check=yes ./a.out<br>
              ==17603== Memcheck, a memory error detector.<br>
              ==17603== Copyright (C) 2002-2006, and GNU GPL'd, by
              Julian Seward et al.<br>
              ==17603== Using LibVEX rev 1658, a library for dynamic
              binary translation.<br>
              ==17603== Copyright (C) 2004-2006, and GNU GPL'd, by
              OpenWorks LLP.<br>
              ==17603== Using valgrind-3.2.1, a dynamic binary
              instrumentation framework.<br>
              ==17603== Copyright (C) 2000-2006, and GNU GPL'd, by
              Julian Seward et al.<br>
              ==17603== For more details, rerun with: -v<br>
              ==17603==<br>
              --17603-- DWARF2 CFI reader: unhandled CFI instruction
              0:10<br>
              --17603-- DWARF2 CFI reader: unhandled CFI instruction
              0:10<br>
              vex amd64->IR: unhandled instruction bytes: 0xF 0xAE
              0x85 0xF0<br>
              ==17603== valgrind: Unrecognised instruction at address
              0x5DD0F0E.<br>
              ==17603== Your program just tried to execute an
              instruction that Valgrind<br>
              ==17603== did not recognise.  There are two possible
              reasons for this.<br>
              ==17603== 1. Your program has a bug and erroneously jumped
              to a non-code<br>
              ==17603==    location.  If you are running Memcheck and
              you just saw a<br>
              ==17603==    warning about a bad jump, it's probably your
              program's fault.<br>
              ==17603== 2. The instruction is legitimate but Valgrind
              doesn't handle it,<br>
              ==17603==    i.e. it's Valgrind's fault.  If you think
              this is the case or<br>
              ==17603==    you are not sure, please let us know and
              we'll try to fix it.<br>
              ==17603== Either way, Valgrind will now raise a SIGILL
              signal which will<br>
              ==17603== probably kill your program.<br>
              forrtl: severe (168): Program Exception - illegal
              instruction<br>
              Image              PC                Routine Line      
               Source<br>
              libifcore.so.5     0000000005DD0F0E  Unknown Unknown
               Unknown<br>
              libifcore.so.5     0000000005DD0DC7  Unknown Unknown
               Unknown<br>
              a.out              0000000001CB4CBB  Unknown Unknown
               Unknown<br>
              a.out              00000000004093DC  Unknown Unknown
               Unknown<br>
              libc.so.6          000000369141D974  Unknown Unknown
               Unknown<br>
              a.out              00000000004092E9  Unknown Unknown
               Unknown<br>
              ==17603==<br>
              ==17603== ERROR SUMMARY: 0 errors from 0 contexts
              (suppressed: 5 from 1)<br>
              ==17603== malloc/free: in use at exit: 239 bytes in 8
              blocks.<br>
              ==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes
              allocated.<br>
              ==17603== For counts of detected errors, rerun with: -v<br>
              ==17603== searching for pointers to 8 not-freed blocks.<br>
              ==17603== checked 2,340,280 bytes.<br>
              ==17603==<br>
              ==17603== LEAK SUMMARY:<br>
              ==17603==    definitely lost: 0 bytes in 0 blocks.<br>
              ==17603==      possibly lost: 0 bytes in 0 blocks.<br>
              ==17603==    still reachable: 239 bytes in 8 blocks.<br>
              ==17603==         suppressed: 0 bytes in 0 blocks.<br>
              ==17603== Reachable blocks (those to which a pointer was
              found) are not shown.<br>
              ==17603== To see them, rerun with: --show-reachable=yes<br>
              <br>
              Thank you<br>
              <br>
              Yours sincerely,<br>
              <br>
              TAY wee-beng<br>
              <br>
              On 23/4/2014 5:18 PM, TAY wee-beng wrote:<br>
              <blockquote class="gmail_quote" style="margin:0 0 0
                .8ex;border-left:1px #ccc solid;padding-left:1ex">
                Hi,<br>
                <br>
                My code was found to be giving error answer in one of
                the cluster, even on single processor. No error msg was
                given. It used to be working fine.<br>
                <br>
                I run the debug version and it gives the error msg:<br>
                <br>
                [0]PETSC ERROR: ------------------------------------------------------------------------<br>
                [0]PETSC ERROR: Caught signal number 8 FPE: Floating
                Point Exception,probably divide by zero<br>
                [0]PETSC ERROR: Try option -start_in_debugger or
                -on_error_attach_debugger<br>
                [0]PETSC ERROR: or see <a moz-do-not-send="true"
href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC"
                  target="_blank">http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC</a>
                ERROR: or try <a moz-do-not-send="true"
                  href="http://valgrind.org" target="_blank">http://valgrind.org</a>
                on GNU/linux and Apple Mac OS X to find memory
                corruption errors<br>
                [0]PETSC ERROR: likely location of problem given in
                stack below<br>
                [0]PETSC ERROR: ---------------------  Stack Frames
                ------------------------------------<br>
                [0]PETSC ERROR: Note: The EXACT line numbers in the
                stack are not available,<br>
                [0]PETSC ERROR:       INSTEAD the line number of the
                start of the function<br>
                [0]PETSC ERROR:       is given.<br>
                [0]PETSC ERROR: [0] VecDot_Seq line 62
                src/vec/vec/impls/seq/bvec1.c<br>
                [0]PETSC ERROR: [0] VecDot_MPI line 14
                src/vec/vec/impls/mpi/pbvec.c<br>
                [0]PETSC ERROR: [0] VecDot line 118
                src/vec/vec/interface/rvector.c<br>
                [0]PETSC ERROR: [0] KSPSolve_BCGS line 39
                src/ksp/ksp/impls/bcgs/bcgs.c<br>
                [0]PETSC ERROR: [0] KSPSolve line 356
                src/ksp/ksp/interface/itfunc.c<br>
                [0]PETSC ERROR: --------------------- Error Message
                ------------------------------------<br>
                [0]PETSC ERROR: Signal received!<br>
                [0]PETSC ERROR: ------------------------------------------------------------------------<br>
                <br>
                It happens after KSPSolve. There was no problem on other
                cluster. So how should I debug to find the error?<br>
                <br>
                I tried to compare the input matrix and vector between
                different cluster but there are too many values.<br>
                <br>
              </blockquote>
              <br>
            </blockquote>
          </div>
          <br>
          <br clear="all">
          <div><br>
          </div>
          -- <br>
          What most experimenters take for granted before they begin
          their experiments is infinitely more interesting than any
          results to which their experiments lead.<br>
          -- Norbert Wiener
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>