<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Apr 23, 2014 at 5:55 AM, TAY wee-beng <span dir="ltr"><<a href="mailto:zonexo@gmail.com" target="_blank">zonexo@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
Just to update that I managed to compare the values by reducing the problem size to hundred plus values. The matrix and vector are almost the same compared to my win7 output.<br></blockquote><div><br></div><div>Run in the debugger and get a stack trace,</div>
<div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Also tried valgrind but it aborts almost immediately:<br>
<br>
valgrind --leak-check=yes ./a.out<br>
==17603== Memcheck, a memory error detector.<br>
==17603== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.<br>
==17603== Using LibVEX rev 1658, a library for dynamic binary translation.<br>
==17603== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.<br>
==17603== Using valgrind-3.2.1, a dynamic binary instrumentation framework.<br>
==17603== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.<br>
==17603== For more details, rerun with: -v<br>
==17603==<br>
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10<br>
--17603-- DWARF2 CFI reader: unhandled CFI instruction 0:10<br>
vex amd64->IR: unhandled instruction bytes: 0xF 0xAE 0x85 0xF0<br>
==17603== valgrind: Unrecognised instruction at address 0x5DD0F0E.<br>
==17603== Your program just tried to execute an instruction that Valgrind<br>
==17603== did not recognise. There are two possible reasons for this.<br>
==17603== 1. Your program has a bug and erroneously jumped to a non-code<br>
==17603== location. If you are running Memcheck and you just saw a<br>
==17603== warning about a bad jump, it's probably your program's fault.<br>
==17603== 2. The instruction is legitimate but Valgrind doesn't handle it,<br>
==17603== i.e. it's Valgrind's fault. If you think this is the case or<br>
==17603== you are not sure, please let us know and we'll try to fix it.<br>
==17603== Either way, Valgrind will now raise a SIGILL signal which will<br>
==17603== probably kill your program.<br>
forrtl: severe (168): Program Exception - illegal instruction<br>
Image PC Routine Line Source<br>
libifcore.so.5 0000000005DD0F0E Unknown Unknown Unknown<br>
libifcore.so.5 0000000005DD0DC7 Unknown Unknown Unknown<br>
a.out 0000000001CB4CBB Unknown Unknown Unknown<br>
a.out 00000000004093DC Unknown Unknown Unknown<br>
libc.so.6 000000369141D974 Unknown Unknown Unknown<br>
a.out 00000000004092E9 Unknown Unknown Unknown<br>
==17603==<br>
==17603== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)<br>
==17603== malloc/free: in use at exit: 239 bytes in 8 blocks.<br>
==17603== malloc/free: 31 allocs, 23 frees, 31,388 bytes allocated.<br>
==17603== For counts of detected errors, rerun with: -v<br>
==17603== searching for pointers to 8 not-freed blocks.<br>
==17603== checked 2,340,280 bytes.<br>
==17603==<br>
==17603== LEAK SUMMARY:<br>
==17603== definitely lost: 0 bytes in 0 blocks.<br>
==17603== possibly lost: 0 bytes in 0 blocks.<br>
==17603== still reachable: 239 bytes in 8 blocks.<br>
==17603== suppressed: 0 bytes in 0 blocks.<br>
==17603== Reachable blocks (those to which a pointer was found) are not shown.<br>
==17603== To see them, rerun with: --show-reachable=yes<br>
<br>
Thank you<br>
<br>
Yours sincerely,<br>
<br>
TAY wee-beng<br>
<br>
On 23/4/2014 5:18 PM, TAY wee-beng wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi,<br>
<br>
My code was found to be giving error answer in one of the cluster, even on single processor. No error msg was given. It used to be working fine.<br>
<br>
I run the debug version and it gives the error msg:<br>
<br>
[0]PETSC ERROR: ------------------------------<u></u>------------------------------<u></u>------------<br>
[0]PETSC ERROR: Caught signal number 8 FPE: Floating Point Exception,probably divide by zero<br>
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>
[0]PETSC ERROR: or see <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC" target="_blank">http://www.mcs.anl.gov/petsc/<u></u>documentation/faq.html#<u></u>valgrind[0]PETSC</a> ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br>
[0]PETSC ERROR: likely location of problem given in stack below<br>
[0]PETSC ERROR: --------------------- Stack Frames ------------------------------<u></u>------<br>
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br>
[0]PETSC ERROR: INSTEAD the line number of the start of the function<br>
[0]PETSC ERROR: is given.<br>
[0]PETSC ERROR: [0] VecDot_Seq line 62 src/vec/vec/impls/seq/bvec1.c<br>
[0]PETSC ERROR: [0] VecDot_MPI line 14 src/vec/vec/impls/mpi/pbvec.c<br>
[0]PETSC ERROR: [0] VecDot line 118 src/vec/vec/interface/rvector.<u></u>c<br>
[0]PETSC ERROR: [0] KSPSolve_BCGS line 39 src/ksp/ksp/impls/bcgs/bcgs.c<br>
[0]PETSC ERROR: [0] KSPSolve line 356 src/ksp/ksp/interface/itfunc.c<br>
[0]PETSC ERROR: --------------------- Error Message ------------------------------<u></u>------<br>
[0]PETSC ERROR: Signal received!<br>
[0]PETSC ERROR: ------------------------------<u></u>------------------------------<u></u>------------<br>
<br>
It happens after KSPSolve. There was no problem on other cluster. So how should I debug to find the error?<br>
<br>
I tried to compare the input matrix and vector between different cluster but there are too many values.<br>
<br>
</blockquote>
<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener
</div></div>