On Mon, Jul 30, 2012 at 5:04 PM, Ronald M. Caplan <span dir="ltr"><<a href="mailto:caplanr@predsci.com" target="_blank">caplanr@predsci.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi everyone,<br><br>I seem to have solved the problem.<br><br>I was storing my entire matrix on node 0 and then calling MatAssembly (begin and end) on all nodes (which should have worked...). <br><br>Apparently I was using too much space for the buffering or the like, because when I change the code so each node sets its own matrix values, than the MatAssemblyEnd does not seg fault. <br>
</blockquote><div><br></div><div>Hmm, it should give a nice error, not SEGV so I am still interested in the stack trace.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Why should this be the case? How many elements of a vector or matrix can a single node "set" before Assembly to distribute over all nodes?<br></blockquote><div><br></div><div>If you are going to set a ton of elements, consider using MAT_ASSEMBLY_FLUSH and calling Assembly a few times during the loop.</div>
<div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> - Ron C<div class="HOEnZb"><div class="h5"><br><br><br><br><div class="gmail_quote">
On Fri, Jul 27, 2012 at 2:14 PM, Ronald M. Caplan <span dir="ltr"><<a href="mailto:caplanr@predsci.com" target="_blank">caplanr@predsci.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br><br>I do not know how to get the stack trace.<br><br>Attached is the code and makefile. <br><br>The value of npts is set to 25 which is where the code crashes with more than one core running. If I set the npts to around 10, then the code works with up to 12 processes (fast too!) but no more otherwise there is a crash as well.<br>
<br>Thanks for your help!<br><br> - Ron C<div><div><br><br><div class="gmail_quote">On Fri, Jul 27, 2012 at 1:52 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>On Fri, Jul 27, 2012 at 3:35 PM, Ronald M. Caplan <span dir="ltr"><<a href="mailto:caplanr@predsci.com" target="_blank">caplanr@predsci.com</a>></span> wrote:<br>
</div><div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
1) Checked it, had no leaks or any other problems that I could see.<br><br>2) Ran it with debugging and without. The debugging is how I know it was in MatAssemblyEnd().<br></blockquote><div><br></div></div><div>Its rare when valgrind does not catch something, but it happens. From here I would really like:</div>
<div><br></div><div> 1) The stack trace from the fault</div><div><br></div><div> 2) The code to run here</div><div><br></div><div>This is one of the oldest and most used pieces of PETSc. Its difficult to believe that the bug is there</div>
<div>rather than a result of earlier memory corruption.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div><div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
3) Here is the matrix part of the code:<br><br>
!Create matrix: <br> call MatCreate(PETSC_COMM_WORLD,A,ierr) <br> call MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,N,N,ierr)<br> call MatSetType(A,MATMPIAIJ,ierr) <br> call MatSetFromOptions(A,ierr)<br>
!print*,'3nrt: ',3*nr*nt <br> i = 16<br> IF(size .eq. 1) THEN<br> j = 0<br> ELSE<br> j = 8<br> END IF <br> call MatMPIAIJSetPreallocation(A,i,PETSC_NULL_INTEGER,<br>
& j,PETSC_NULL_INTEGER,ierr)<br> <br> !Do not call this if using preallocation!<br> !call MatSetUp(A,ierr) <br> <br> call MatGetOwnershipRange(A,i,j,ierr)<br>
print*,'Rank ',rank,' has range ',i,' and ',j<br> <br> !Get MAS matrix in CSR format (random numbers for now): <br> IF (rank .eq. 0) THEN <br> call GET_RAND_MAS_MATRIX(CSR_A,CSR_AI,CSR_AJ,nr,nt,np,M) <br>
print*,'Number of non-zero entries in matrix:',M <br> !Store matrix values one-by-one (inefficient: better way<br> ! more complicated - implement later)<br> <br> DO i=1,N<br>
!print*,'numofnonzerosinrowi:',CSR_AJ(i+1)-CSR_AJ(i)+1<br> DO j=CSR_AJ(i)+1,CSR_AJ(i+1)<br> call MatSetValue(A,i-1,CSR_AI(j),CSR_A(j),<br> & INSERT_VALUES,ierr) <br>
<br> END DO<br> END DO <br> print*,'Done setting matrix values...' <br> END IF<br> <br> !Assemble matrix A across all cores:<br> call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY,ierr)<br>
print*,'between assembly'<br> call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY,ierr)<br><br><br><br>A couple things to note:<br>a) my CSR_AJ is what most peaople would call ai etc<br>b) my CSR array values are 0-index but the arrays are 1-indexed.<br>
<br><br><br>Here is the run with one processor (-n 1):<br><br>sumseq:PETSc sumseq$ valgrind mpiexec -n 1 ./petsctest -mat_view_info<br>==26297== Memcheck, a memory error detector<br>==26297== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.<br>
==26297== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info<br>==26297== Command: mpiexec -n 1 ./petsctest -mat_view_info<br>==26297== <br>UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]<br>
N: 46575<br> cores: 1<br> MPI TEST: My rank is: 0<br> Rank 0 has range 0 and 46575<br> Number of non-zero entries in matrix: 690339<br> Done setting matrix values...<br>
between assembly<br>Matrix Object: 1 MPI processes<br> type: mpiaij<br> rows=46575, cols=46575<br> total: nonzeros=690339, allocated nonzeros=745200<br> total number of mallocs used during MatSetValues calls =0<br> not using I-node (on process 0) routines<br>
PETSc y=Ax time: 367.9164 nsec/mp.<br> PETSc y=Ax flops: 0.2251188 GFLOPS.<br>==26297== <br>==26297== HEAP SUMMARY:<br>==26297== in use at exit: 139,984 bytes in 65 blocks<br>==26297== total heap usage: 938 allocs, 873 frees, 229,722 bytes allocated<br>
==26297== <br>==26297== LEAK SUMMARY:<br>==26297== definitely lost: 0 bytes in 0 blocks<br>==26297== indirectly lost: 0 bytes in 0 blocks<br>==26297== possibly lost: 0 bytes in 0 blocks<br>==26297== still reachable: 139,984 bytes in 65 blocks<br>
==26297== suppressed: 0 bytes in 0 blocks<br>==26297== Rerun with --leak-check=full to see details of leaked memory<br>==26297== <br>==26297== For counts of detected and suppressed errors, rerun with: -v<br>==26297== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)<br>
sumseq:PETSc sumseq$ <br><br><br><br>Here is the run with 2 processors (-n 2)<br><br>sumseq:PETSc sumseq$ valgrind mpiexec -n 2 ./petsctest -mat_view_info<br>==26301== Memcheck, a memory error detector<br>==26301== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.<br>
==26301== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info<br>==26301== Command: mpiexec -n 2 ./petsctest -mat_view_info<br>==26301== <br>UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]<br>
N: 46575<br> cores: 2<br> MPI TEST: My rank is: 0<br> MPI TEST: My rank is: 1<br> Rank 0 has range 0 and 23288<br> Rank 1 has range 23288 and 46575<br>
Number of non-zero entries in matrix: 690339<br> Done setting matrix values...<br> between assembly<br> between assembly<br>[1]PETSC ERROR: ------------------------------------------------------------------------<br>
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range<br>[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger<br>[1]PETSC ERROR: or see <a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC" target="_blank">http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC</a> ERROR: or try <a href="http://valgrind.org" target="_blank">http://valgrind.org</a> on GNU/linux and Apple Mac OS X to find memory corruption errors<br>
[1]PETSC ERROR: likely location of problem given in stack below<br>[1]PETSC ERROR: --------------------- Stack Frames ------------------------------------<br>[1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,<br>
[1]PETSC ERROR: INSTEAD the line number of the start of the function<br>[1]PETSC ERROR: is given.<br>[1]PETSC ERROR: [1] MatStashScatterGetMesg_Private line 609 /usr/local/petsc-3.3-p2/src/mat/utils/matstash.c<br>
[1]PETSC ERROR: [1] MatAssemblyEnd_MPIAIJ line 646 /usr/local/petsc-3.3-p2/src/mat/impls/aij/mpi/mpiaij.c<br>[1]PETSC ERROR: [1] MatAssemblyEnd line 4857 /usr/local/petsc-3.3-p2/src/mat/interface/matrix.c<br>[1]PETSC ERROR: --------------------- Error Message ------------------------------------<br>
[1]PETSC ERROR: Signal received!<br>[1]PETSC ERROR: ------------------------------------------------------------------------<br>[1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13 15:42:00 CDT 2012 <br>[1]PETSC ERROR: See docs/changes/index.html for recent updates.<br>
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>[1]PETSC ERROR: See docs/index.html for manual pages.<br>[1]PETSC ERROR: ------------------------------------------------------------------------<br>
[1]PETSC ERROR: ./petsctest on a arch-darw named <a href="http://sumseq.predsci.com" target="_blank">sumseq.predsci.com</a> by sumseq Fri Jul 27 13:34:36 2012<br>
[1]PETSC ERROR: Libraries linked from /usr/local/petsc-3.3-p2/arch-darwin-c-debug/lib<br>[1]PETSC ERROR: Configure run at Fri Jul 27 13:28:26 2012<br>[1]PETSC ERROR: Configure options --with-debugging=1<br>[1]PETSC ERROR: ------------------------------------------------------------------------<br>
[1]PETSC ERROR: User provided function() line 0 in unknown directory unknown file<br>application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1<br>[cli_1]: aborting job:<br>application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1<br>
==26301== <br>==26301== HEAP SUMMARY:<br>==26301== in use at exit: 139,984 bytes in 65 blocks<br>==26301== total heap usage: 1,001 allocs, 936 frees, 234,886 bytes allocated<br>==26301== <br>==26301== LEAK SUMMARY:<br>
==26301== definitely lost: 0 bytes in 0 blocks<br>==26301== indirectly lost: 0 bytes in 0 blocks<br>==26301== possibly lost: 0 bytes in 0 blocks<br>==26301== still reachable: 139,984 bytes in 65 blocks<br>==26301== suppressed: 0 bytes in 0 blocks<br>
==26301== Rerun with --leak-check=full to see details of leaked memory<br>==26301== <br>==26301== For counts of detected and suppressed errors, rerun with: -v<br>==26301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)<br>
sumseq:PETSc sumseq$ <br><br><br><br> - Ron<br><br><br><br><br><br><br><div class="gmail_quote">On Fri, Jul 27, 2012 at 1:19 PM, Jed Brown <span dir="ltr"><<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">1. Check for memory leaks using Valgrind.<div><br></div><div>2. Be sure to run --with-debugging=1 (the default) when trying to find the error.</div>
<div><br></div><div>3. Send the full error message and the relevant bit of code.<div><div><br>
<div><br><div class="gmail_quote">On Fri, Jul 27, 2012 at 3:17 PM, Ronald M. Caplan <span dir="ltr"><<a href="mailto:caplanr@predsci.com" target="_blank">caplanr@predsci.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello,<br><br>I am running a simple test code which takes a sparse AIJ matrix in PETSc and multiplies it by a vector.<br><br>The matrix is defined as an AIJ MPI matrix.<br><br>When I run the program on a single core, it runs fine.<br>
<br>When I run it using MPI with multiple threads (I am on a 4-core, 8-thread MAC) I can get the code to run correctly for matrices under a certain size (2880 X 2880), but when the matrix is set to be larger, the code crashes with a segfault and the error says it was in the MatAssemblyEnd(). Sometimes it works with -n 2, but typically it always crashes when using multi-core. <br>
<br>Any ideas on what it could be? <br><br>Thanks,<br><br>Ron Caplan<br>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br>
</blockquote></div></div></div><span><font color="#888888"><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener<br>
</font></span></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener<br>