[petsc-users] segfault in MatAssemblyEnd() when using large matrices on multi-core MAC OS-X

Matthew Knepley knepley at gmail.com
Fri Jul 27 15:52:06 CDT 2012


On Fri, Jul 27, 2012 at 3:35 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:

> 1) Checked it, had no leaks or any other problems that I could see.
>
> 2) Ran it with debugging and without.  The debugging is how I know it was
> in MatAssemblyEnd().
>

Its rare when valgrind does not catch something, but it happens. From here
I would really like:

  1) The stack trace from the fault

  2) The code to run here

This is one of the oldest and most used pieces of PETSc. Its difficult to
believe that the bug is there
rather than a result of earlier memory corruption.

   Thanks,

      Matt


> 3)  Here is the matrix part of the code:
>
> !Create matrix:
>       call MatCreate(PETSC_COMM_WORLD,A,ierr)
>       call MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,N,N,ierr)
>       call MatSetType(A,MATMPIAIJ,ierr)
>       call MatSetFromOptions(A,ierr)
>       !print*,'3nrt: ',3*nr*nt
>       i = 16
>       IF(size .eq. 1) THEN
>           j = 0
>       ELSE
>           j = 8
>       END IF
>       call MatMPIAIJSetPreallocation(A,i,PETSC_NULL_INTEGER,
>      &                               j,PETSC_NULL_INTEGER,ierr)
>
>       !Do not call this if using preallocation!
>       !call MatSetUp(A,ierr)
>
>       call MatGetOwnershipRange(A,i,j,ierr)
>       print*,'Rank ',rank,' has range ',i,' and ',j
>
>       !Get MAS matrix in CSR format (random numbers for now):
>       IF (rank .eq. 0) THEN
>          call GET_RAND_MAS_MATRIX(CSR_A,CSR_AI,CSR_AJ,nr,nt,np,M)
>          print*,'Number of non-zero entries in matrix:',M
>           !Store matrix values one-by-one (inefficient:  better way
>              !   more complicated - implement later)
>
>          DO i=1,N
>            !print*,'numofnonzerosinrowi:',CSR_AJ(i+1)-CSR_AJ(i)+1
>             DO j=CSR_AJ(i)+1,CSR_AJ(i+1)
>                call MatSetValue(A,i-1,CSR_AI(j),CSR_A(j),
>      &                               INSERT_VALUES,ierr)
>
>             END DO
>          END DO
>          print*,'Done setting matrix values...'
>       END IF
>
>       !Assemble matrix A across all cores:
>       call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY,ierr)
>       print*,'between assembly'
>       call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY,ierr)
>
>
>
> A couple things to note:
> a) my CSR_AJ is what most peaople would call ai etc
> b) my CSR array values are 0-index but the arrays are 1-indexed.
>
>
>
> Here is the run with one processor (-n 1):
>
> sumseq:PETSc sumseq$ valgrind mpiexec -n 1 ./petsctest -mat_view_info
> ==26297== Memcheck, a memory error detector
> ==26297== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
> ==26297== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
> ==26297== Command: mpiexec -n 1 ./petsctest -mat_view_info
> ==26297==
> UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
>  N:        46575
>  cores:            1
>  MPI TEST:  My rank is:           0
>  Rank            0  has range            0  and        46575
>  Number of non-zero entries in matrix:      690339
>  Done setting matrix values...
>  between assembly
> Matrix Object: 1 MPI processes
>   type: mpiaij
>   rows=46575, cols=46575
>   total: nonzeros=690339, allocated nonzeros=745200
>   total number of mallocs used during MatSetValues calls =0
>     not using I-node (on process 0) routines
>  PETSc y=Ax time:      367.9164     nsec/mp.
>  PETSc y=Ax flops:    0.2251188     GFLOPS.
> ==26297==
> ==26297== HEAP SUMMARY:
> ==26297==     in use at exit: 139,984 bytes in 65 blocks
> ==26297==   total heap usage: 938 allocs, 873 frees, 229,722 bytes
> allocated
> ==26297==
> ==26297== LEAK SUMMARY:
> ==26297==    definitely lost: 0 bytes in 0 blocks
> ==26297==    indirectly lost: 0 bytes in 0 blocks
> ==26297==      possibly lost: 0 bytes in 0 blocks
> ==26297==    still reachable: 139,984 bytes in 65 blocks
> ==26297==         suppressed: 0 bytes in 0 blocks
> ==26297== Rerun with --leak-check=full to see details of leaked memory
> ==26297==
> ==26297== For counts of detected and suppressed errors, rerun with: -v
> ==26297== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
> sumseq:PETSc sumseq$
>
>
>
> Here is the run with 2 processors (-n 2)
>
> sumseq:PETSc sumseq$ valgrind mpiexec -n 2 ./petsctest -mat_view_info
> ==26301== Memcheck, a memory error detector
> ==26301== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
> ==26301== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
> ==26301== Command: mpiexec -n 2 ./petsctest -mat_view_info
> ==26301==
> UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
>  N:        46575
>  cores:            2
>  MPI TEST:  My rank is:           0
>  MPI TEST:  My rank is:           1
>  Rank            0  has range            0  and        23288
>  Rank            1  has range        23288  and        46575
>  Number of non-zero entries in matrix:      690339
>  Done setting matrix values...
>  between assembly
>  between assembly
> [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [1]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSCERROR: or try
> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
> corruption errors
> [1]PETSC ERROR: likely location of problem given in stack below
> [1]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
> [1]PETSC ERROR:       is given.
> [1]PETSC ERROR: [1] MatStashScatterGetMesg_Private line 609
> /usr/local/petsc-3.3-p2/src/mat/utils/matstash.c
> [1]PETSC ERROR: [1] MatAssemblyEnd_MPIAIJ line 646
> /usr/local/petsc-3.3-p2/src/mat/impls/aij/mpi/mpiaij.c
> [1]PETSC ERROR: [1] MatAssemblyEnd line 4857
> /usr/local/petsc-3.3-p2/src/mat/interface/matrix.c
> [1]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [1]PETSC ERROR: Signal received!
> [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13 15:42:00
> CDT 2012
> [1]PETSC ERROR: See docs/changes/index.html for recent updates.
> [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [1]PETSC ERROR: See docs/index.html for manual pages.
> [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [1]PETSC ERROR: ./petsctest on a arch-darw named sumseq.predsci.com by
> sumseq Fri Jul 27 13:34:36 2012
> [1]PETSC ERROR: Libraries linked from
> /usr/local/petsc-3.3-p2/arch-darwin-c-debug/lib
> [1]PETSC ERROR: Configure run at Fri Jul 27 13:28:26 2012
> [1]PETSC ERROR: Configure options --with-debugging=1
> [1]PETSC ERROR:
> ------------------------------------------------------------------------
> [1]PETSC ERROR: User provided function() line 0 in unknown directory
> unknown file
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
> [cli_1]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
> ==26301==
> ==26301== HEAP SUMMARY:
> ==26301==     in use at exit: 139,984 bytes in 65 blocks
> ==26301==   total heap usage: 1,001 allocs, 936 frees, 234,886 bytes
> allocated
> ==26301==
> ==26301== LEAK SUMMARY:
> ==26301==    definitely lost: 0 bytes in 0 blocks
> ==26301==    indirectly lost: 0 bytes in 0 blocks
> ==26301==      possibly lost: 0 bytes in 0 blocks
> ==26301==    still reachable: 139,984 bytes in 65 blocks
> ==26301==         suppressed: 0 bytes in 0 blocks
> ==26301== Rerun with --leak-check=full to see details of leaked memory
> ==26301==
> ==26301== For counts of detected and suppressed errors, rerun with: -v
> ==26301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
> sumseq:PETSc sumseq$
>
>
>
>  - Ron
>
>
>
>
>
>
> On Fri, Jul 27, 2012 at 1:19 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>
>> 1. Check for memory leaks using Valgrind.
>>
>> 2. Be sure to run --with-debugging=1 (the default) when trying to find
>> the error.
>>
>> 3. Send the full error message and the relevant bit of code.
>>
>>
>> On Fri, Jul 27, 2012 at 3:17 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:
>>
>>> Hello,
>>>
>>> I am running a simple test code which takes a sparse AIJ matrix in PETSc
>>> and multiplies it by a vector.
>>>
>>> The matrix is defined as an AIJ MPI matrix.
>>>
>>> When I run the program on a single core, it runs fine.
>>>
>>> When I run it using MPI with multiple threads (I am on a 4-core,
>>> 8-thread MAC) I can get the code to run correctly for matrices under a
>>> certain size (2880 X 2880), but when the matrix is set to be larger, the
>>> code crashes with a segfault and the error says it was in the
>>> MatAssemblyEnd().  Sometimes it works with -n 2, but typically it always
>>> crashes when using multi-core.
>>>
>>> Any ideas on what it could be?
>>>
>>> Thanks,
>>>
>>> Ron Caplan
>>>
>>
>>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120727/6c9a5a7c/attachment-0001.html>


More information about the petsc-users mailing list