[petsc-users] segfault in MatAssemblyEnd() when using large matrices on multi-core MAC OS-X

Ronald M. Caplan caplanr at predsci.com
Fri Jul 27 16:14:13 CDT 2012


Hi,

I do not know how to get the stack trace.

Attached is the code and makefile.

The value of npts is set to 25 which is where the code crashes with more
than one core running.   If I set the npts to around 10, then the code
works with up to 12 processes (fast too!) but no more otherwise there is a
crash as well.

Thanks for your help!

 - Ron C

On Fri, Jul 27, 2012 at 1:52 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Fri, Jul 27, 2012 at 3:35 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:
>
>> 1) Checked it, had no leaks or any other problems that I could see.
>>
>> 2) Ran it with debugging and without.  The debugging is how I know it was
>> in MatAssemblyEnd().
>>
>
> Its rare when valgrind does not catch something, but it happens. From here
> I would really like:
>
>   1) The stack trace from the fault
>
>   2) The code to run here
>
> This is one of the oldest and most used pieces of PETSc. Its difficult to
> believe that the bug is there
> rather than a result of earlier memory corruption.
>
>    Thanks,
>
>       Matt
>
>
>> 3)  Here is the matrix part of the code:
>>
>> !Create matrix:
>>       call MatCreate(PETSC_COMM_WORLD,A,ierr)
>>       call MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,N,N,ierr)
>>       call MatSetType(A,MATMPIAIJ,ierr)
>>       call MatSetFromOptions(A,ierr)
>>       !print*,'3nrt: ',3*nr*nt
>>       i = 16
>>       IF(size .eq. 1) THEN
>>           j = 0
>>       ELSE
>>           j = 8
>>       END IF
>>       call MatMPIAIJSetPreallocation(A,i,PETSC_NULL_INTEGER,
>>      &                               j,PETSC_NULL_INTEGER,ierr)
>>
>>       !Do not call this if using preallocation!
>>       !call MatSetUp(A,ierr)
>>
>>       call MatGetOwnershipRange(A,i,j,ierr)
>>       print*,'Rank ',rank,' has range ',i,' and ',j
>>
>>       !Get MAS matrix in CSR format (random numbers for now):
>>       IF (rank .eq. 0) THEN
>>          call GET_RAND_MAS_MATRIX(CSR_A,CSR_AI,CSR_AJ,nr,nt,np,M)
>>          print*,'Number of non-zero entries in matrix:',M
>>           !Store matrix values one-by-one (inefficient:  better way
>>              !   more complicated - implement later)
>>
>>          DO i=1,N
>>            !print*,'numofnonzerosinrowi:',CSR_AJ(i+1)-CSR_AJ(i)+1
>>             DO j=CSR_AJ(i)+1,CSR_AJ(i+1)
>>                call MatSetValue(A,i-1,CSR_AI(j),CSR_A(j),
>>      &                               INSERT_VALUES,ierr)
>>
>>             END DO
>>          END DO
>>          print*,'Done setting matrix values...'
>>       END IF
>>
>>       !Assemble matrix A across all cores:
>>       call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY,ierr)
>>       print*,'between assembly'
>>       call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY,ierr)
>>
>>
>>
>> A couple things to note:
>> a) my CSR_AJ is what most peaople would call ai etc
>> b) my CSR array values are 0-index but the arrays are 1-indexed.
>>
>>
>>
>> Here is the run with one processor (-n 1):
>>
>> sumseq:PETSc sumseq$ valgrind mpiexec -n 1 ./petsctest -mat_view_info
>> ==26297== Memcheck, a memory error detector
>> ==26297== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
>> ==26297== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright
>> info
>> ==26297== Command: mpiexec -n 1 ./petsctest -mat_view_info
>> ==26297==
>> UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
>>  N:        46575
>>  cores:            1
>>  MPI TEST:  My rank is:           0
>>  Rank            0  has range            0  and        46575
>>  Number of non-zero entries in matrix:      690339
>>  Done setting matrix values...
>>  between assembly
>> Matrix Object: 1 MPI processes
>>   type: mpiaij
>>   rows=46575, cols=46575
>>   total: nonzeros=690339, allocated nonzeros=745200
>>   total number of mallocs used during MatSetValues calls =0
>>     not using I-node (on process 0) routines
>>  PETSc y=Ax time:      367.9164     nsec/mp.
>>  PETSc y=Ax flops:    0.2251188     GFLOPS.
>> ==26297==
>> ==26297== HEAP SUMMARY:
>> ==26297==     in use at exit: 139,984 bytes in 65 blocks
>> ==26297==   total heap usage: 938 allocs, 873 frees, 229,722 bytes
>> allocated
>> ==26297==
>> ==26297== LEAK SUMMARY:
>> ==26297==    definitely lost: 0 bytes in 0 blocks
>> ==26297==    indirectly lost: 0 bytes in 0 blocks
>> ==26297==      possibly lost: 0 bytes in 0 blocks
>> ==26297==    still reachable: 139,984 bytes in 65 blocks
>> ==26297==         suppressed: 0 bytes in 0 blocks
>> ==26297== Rerun with --leak-check=full to see details of leaked memory
>> ==26297==
>> ==26297== For counts of detected and suppressed errors, rerun with: -v
>> ==26297== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
>> sumseq:PETSc sumseq$
>>
>>
>>
>> Here is the run with 2 processors (-n 2)
>>
>> sumseq:PETSc sumseq$ valgrind mpiexec -n 2 ./petsctest -mat_view_info
>> ==26301== Memcheck, a memory error detector
>> ==26301== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
>> ==26301== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright
>> info
>> ==26301== Command: mpiexec -n 2 ./petsctest -mat_view_info
>> ==26301==
>> UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
>>  N:        46575
>>  cores:            2
>>  MPI TEST:  My rank is:           0
>>  MPI TEST:  My rank is:           1
>>  Rank            0  has range            0  and        23288
>>  Rank            1  has range        23288  and        46575
>>  Number of non-zero entries in matrix:      690339
>>  Done setting matrix values...
>>  between assembly
>>  between assembly
>> [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> [1]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSCERROR: or try
>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>> corruption errors
>> [1]PETSC ERROR: likely location of problem given in stack below
>> [1]PETSC ERROR: ---------------------  Stack Frames
>> ------------------------------------
>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>> available,
>> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>> [1]PETSC ERROR:       is given.
>> [1]PETSC ERROR: [1] MatStashScatterGetMesg_Private line 609
>> /usr/local/petsc-3.3-p2/src/mat/utils/matstash.c
>> [1]PETSC ERROR: [1] MatAssemblyEnd_MPIAIJ line 646
>> /usr/local/petsc-3.3-p2/src/mat/impls/aij/mpi/mpiaij.c
>> [1]PETSC ERROR: [1] MatAssemblyEnd line 4857
>> /usr/local/petsc-3.3-p2/src/mat/interface/matrix.c
>> [1]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> [1]PETSC ERROR: Signal received!
>> [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13 15:42:00
>> CDT 2012
>> [1]PETSC ERROR: See docs/changes/index.html for recent updates.
>> [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> [1]PETSC ERROR: See docs/index.html for manual pages.
>> [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [1]PETSC ERROR: ./petsctest on a arch-darw named sumseq.predsci.com by
>> sumseq Fri Jul 27 13:34:36 2012
>> [1]PETSC ERROR: Libraries linked from
>> /usr/local/petsc-3.3-p2/arch-darwin-c-debug/lib
>> [1]PETSC ERROR: Configure run at Fri Jul 27 13:28:26 2012
>> [1]PETSC ERROR: Configure options --with-debugging=1
>> [1]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [1]PETSC ERROR: User provided function() line 0 in unknown directory
>> unknown file
>> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
>> [cli_1]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
>> ==26301==
>> ==26301== HEAP SUMMARY:
>> ==26301==     in use at exit: 139,984 bytes in 65 blocks
>> ==26301==   total heap usage: 1,001 allocs, 936 frees, 234,886 bytes
>> allocated
>> ==26301==
>> ==26301== LEAK SUMMARY:
>> ==26301==    definitely lost: 0 bytes in 0 blocks
>> ==26301==    indirectly lost: 0 bytes in 0 blocks
>> ==26301==      possibly lost: 0 bytes in 0 blocks
>> ==26301==    still reachable: 139,984 bytes in 65 blocks
>> ==26301==         suppressed: 0 bytes in 0 blocks
>> ==26301== Rerun with --leak-check=full to see details of leaked memory
>> ==26301==
>> ==26301== For counts of detected and suppressed errors, rerun with: -v
>> ==26301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
>> sumseq:PETSc sumseq$
>>
>>
>>
>>  - Ron
>>
>>
>>
>>
>>
>>
>> On Fri, Jul 27, 2012 at 1:19 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
>>
>>> 1. Check for memory leaks using Valgrind.
>>>
>>> 2. Be sure to run --with-debugging=1 (the default) when trying to find
>>> the error.
>>>
>>> 3. Send the full error message and the relevant bit of code.
>>>
>>>
>>> On Fri, Jul 27, 2012 at 3:17 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:
>>>
>>>> Hello,
>>>>
>>>> I am running a simple test code which takes a sparse AIJ matrix in
>>>> PETSc and multiplies it by a vector.
>>>>
>>>> The matrix is defined as an AIJ MPI matrix.
>>>>
>>>> When I run the program on a single core, it runs fine.
>>>>
>>>> When I run it using MPI with multiple threads (I am on a 4-core,
>>>> 8-thread MAC) I can get the code to run correctly for matrices under a
>>>> certain size (2880 X 2880), but when the matrix is set to be larger, the
>>>> code crashes with a segfault and the error says it was in the
>>>> MatAssemblyEnd().  Sometimes it works with -n 2, but typically it always
>>>> crashes when using multi-core.
>>>>
>>>> Any ideas on what it could be?
>>>>
>>>> Thanks,
>>>>
>>>> Ron Caplan
>>>>
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120727/ec4a1b48/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsctest.F
Type: application/octet-stream
Size: 15159 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120727/ec4a1b48/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: makefile
Type: application/octet-stream
Size: 265 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120727/ec4a1b48/attachment-0003.obj>


More information about the petsc-users mailing list