[petsc-users] segfault in MatAssemblyEnd() when using large matrices on multi-core MAC OS-X

Ronald M. Caplan caplanr at predsci.com
Fri Jul 27 15:35:33 CDT 2012


1) Checked it, had no leaks or any other problems that I could see.

2) Ran it with debugging and without.  The debugging is how I know it was
in MatAssemblyEnd().

3)  Here is the matrix part of the code:

!Create matrix:
      call MatCreate(PETSC_COMM_WORLD,A,ierr)
      call MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,N,N,ierr)
      call MatSetType(A,MATMPIAIJ,ierr)
      call MatSetFromOptions(A,ierr)
      !print*,'3nrt: ',3*nr*nt
      i = 16
      IF(size .eq. 1) THEN
          j = 0
      ELSE
          j = 8
      END IF
      call MatMPIAIJSetPreallocation(A,i,PETSC_NULL_INTEGER,
     &                               j,PETSC_NULL_INTEGER,ierr)

      !Do not call this if using preallocation!
      !call MatSetUp(A,ierr)

      call MatGetOwnershipRange(A,i,j,ierr)
      print*,'Rank ',rank,' has range ',i,' and ',j

      !Get MAS matrix in CSR format (random numbers for now):
      IF (rank .eq. 0) THEN
         call GET_RAND_MAS_MATRIX(CSR_A,CSR_AI,CSR_AJ,nr,nt,np,M)
         print*,'Number of non-zero entries in matrix:',M
          !Store matrix values one-by-one (inefficient:  better way
             !   more complicated - implement later)

         DO i=1,N
           !print*,'numofnonzerosinrowi:',CSR_AJ(i+1)-CSR_AJ(i)+1
            DO j=CSR_AJ(i)+1,CSR_AJ(i+1)
               call MatSetValue(A,i-1,CSR_AI(j),CSR_A(j),
     &                               INSERT_VALUES,ierr)

            END DO
         END DO
         print*,'Done setting matrix values...'
      END IF

      !Assemble matrix A across all cores:
      call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY,ierr)
      print*,'between assembly'
      call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY,ierr)



A couple things to note:
a) my CSR_AJ is what most peaople would call ai etc
b) my CSR array values are 0-index but the arrays are 1-indexed.



Here is the run with one processor (-n 1):

sumseq:PETSc sumseq$ valgrind mpiexec -n 1 ./petsctest -mat_view_info
==26297== Memcheck, a memory error detector
==26297== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==26297== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==26297== Command: mpiexec -n 1 ./petsctest -mat_view_info
==26297==
UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
 N:        46575
 cores:            1
 MPI TEST:  My rank is:           0
 Rank            0  has range            0  and        46575
 Number of non-zero entries in matrix:      690339
 Done setting matrix values...
 between assembly
Matrix Object: 1 MPI processes
  type: mpiaij
  rows=46575, cols=46575
  total: nonzeros=690339, allocated nonzeros=745200
  total number of mallocs used during MatSetValues calls =0
    not using I-node (on process 0) routines
 PETSc y=Ax time:      367.9164     nsec/mp.
 PETSc y=Ax flops:    0.2251188     GFLOPS.
==26297==
==26297== HEAP SUMMARY:
==26297==     in use at exit: 139,984 bytes in 65 blocks
==26297==   total heap usage: 938 allocs, 873 frees, 229,722 bytes allocated
==26297==
==26297== LEAK SUMMARY:
==26297==    definitely lost: 0 bytes in 0 blocks
==26297==    indirectly lost: 0 bytes in 0 blocks
==26297==      possibly lost: 0 bytes in 0 blocks
==26297==    still reachable: 139,984 bytes in 65 blocks
==26297==         suppressed: 0 bytes in 0 blocks
==26297== Rerun with --leak-check=full to see details of leaked memory
==26297==
==26297== For counts of detected and suppressed errors, rerun with: -v
==26297== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
sumseq:PETSc sumseq$



Here is the run with 2 processors (-n 2)

sumseq:PETSc sumseq$ valgrind mpiexec -n 2 ./petsctest -mat_view_info
==26301== Memcheck, a memory error detector
==26301== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==26301== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==26301== Command: mpiexec -n 2 ./petsctest -mat_view_info
==26301==
UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
 N:        46575
 cores:            2
 MPI TEST:  My rank is:           0
 MPI TEST:  My rank is:           1
 Rank            0  has range            0  and        23288
 Rank            1  has range        23288  and        46575
 Number of non-zero entries in matrix:      690339
 Done setting matrix values...
 between assembly
 between assembly
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC ERROR:
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
corruption errors
[1]PETSC ERROR: likely location of problem given in stack below
[1]PETSC ERROR: ---------------------  Stack Frames
------------------------------------
[1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[1]PETSC ERROR:       INSTEAD the line number of the start of the function
[1]PETSC ERROR:       is given.
[1]PETSC ERROR: [1] MatStashScatterGetMesg_Private line 609
/usr/local/petsc-3.3-p2/src/mat/utils/matstash.c
[1]PETSC ERROR: [1] MatAssemblyEnd_MPIAIJ line 646
/usr/local/petsc-3.3-p2/src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: [1] MatAssemblyEnd line 4857
/usr/local/petsc-3.3-p2/src/mat/interface/matrix.c
[1]PETSC ERROR: --------------------- Error Message
------------------------------------
[1]PETSC ERROR: Signal received!
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13 15:42:00
CDT 2012
[1]PETSC ERROR: See docs/changes/index.html for recent updates.
[1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[1]PETSC ERROR: See docs/index.html for manual pages.
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: ./petsctest on a arch-darw named sumseq.predsci.com by
sumseq Fri Jul 27 13:34:36 2012
[1]PETSC ERROR: Libraries linked from
/usr/local/petsc-3.3-p2/arch-darwin-c-debug/lib
[1]PETSC ERROR: Configure run at Fri Jul 27 13:28:26 2012
[1]PETSC ERROR: Configure options --with-debugging=1
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: User provided function() line 0 in unknown directory
unknown file
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
==26301==
==26301== HEAP SUMMARY:
==26301==     in use at exit: 139,984 bytes in 65 blocks
==26301==   total heap usage: 1,001 allocs, 936 frees, 234,886 bytes
allocated
==26301==
==26301== LEAK SUMMARY:
==26301==    definitely lost: 0 bytes in 0 blocks
==26301==    indirectly lost: 0 bytes in 0 blocks
==26301==      possibly lost: 0 bytes in 0 blocks
==26301==    still reachable: 139,984 bytes in 65 blocks
==26301==         suppressed: 0 bytes in 0 blocks
==26301== Rerun with --leak-check=full to see details of leaked memory
==26301==
==26301== For counts of detected and suppressed errors, rerun with: -v
==26301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
sumseq:PETSc sumseq$



 - Ron






On Fri, Jul 27, 2012 at 1:19 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> 1. Check for memory leaks using Valgrind.
>
> 2. Be sure to run --with-debugging=1 (the default) when trying to find the
> error.
>
> 3. Send the full error message and the relevant bit of code.
>
>
> On Fri, Jul 27, 2012 at 3:17 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:
>
>> Hello,
>>
>> I am running a simple test code which takes a sparse AIJ matrix in PETSc
>> and multiplies it by a vector.
>>
>> The matrix is defined as an AIJ MPI matrix.
>>
>> When I run the program on a single core, it runs fine.
>>
>> When I run it using MPI with multiple threads (I am on a 4-core, 8-thread
>> MAC) I can get the code to run correctly for matrices under a certain size
>> (2880 X 2880), but when the matrix is set to be larger, the code crashes
>> with a segfault and the error says it was in the MatAssemblyEnd().
>> Sometimes it works with -n 2, but typically it always crashes when using
>> multi-core.
>>
>> Any ideas on what it could be?
>>
>> Thanks,
>>
>> Ron Caplan
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120727/5d842b2e/attachment.html>


More information about the petsc-users mailing list