[petsc-users] segfault in MatAssemblyEnd() when using large matrices on multi-core MAC OS-X

Matthew Knepley knepley at gmail.com
Mon Jul 30 17:11:49 CDT 2012


On Mon, Jul 30, 2012 at 5:04 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:

> Hi everyone,
>
> I seem to have solved the problem.
>
> I was storing my entire matrix on node 0 and then calling MatAssembly
> (begin and end) on all nodes (which should have worked...).
>
> Apparently I was using too much space for the buffering or the like,
> because when I change the code so each node sets its own matrix values,
> than the MatAssemblyEnd does not seg fault.
>

Hmm, it should give a nice error, not SEGV so I am still interested in the
stack trace.


> Why should this be the case?   How many elements of a vector or matrix can
> a single node "set" before Assembly to distribute over all nodes?
>

If you are going to set a ton of elements, consider using
MAT_ASSEMBLY_FLUSH and calling Assembly a few times during the loop.

    Matt


>  - Ron C
>
>
>
>
> On Fri, Jul 27, 2012 at 2:14 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:
>
>> Hi,
>>
>> I do not know how to get the stack trace.
>>
>> Attached is the code and makefile.
>>
>> The value of npts is set to 25 which is where the code crashes with more
>> than one core running.   If I set the npts to around 10, then the code
>> works with up to 12 processes (fast too!) but no more otherwise there is a
>> crash as well.
>>
>> Thanks for your help!
>>
>>  - Ron C
>>
>>
>> On Fri, Jul 27, 2012 at 1:52 PM, Matthew Knepley <knepley at gmail.com>wrote:
>>
>>> On Fri, Jul 27, 2012 at 3:35 PM, Ronald M. Caplan <caplanr at predsci.com>wrote:
>>>
>>>> 1) Checked it, had no leaks or any other problems that I could see.
>>>>
>>>> 2) Ran it with debugging and without.  The debugging is how I know it
>>>> was in MatAssemblyEnd().
>>>>
>>>
>>> Its rare when valgrind does not catch something, but it happens. From
>>> here I would really like:
>>>
>>>   1) The stack trace from the fault
>>>
>>>   2) The code to run here
>>>
>>> This is one of the oldest and most used pieces of PETSc. Its difficult
>>> to believe that the bug is there
>>> rather than a result of earlier memory corruption.
>>>
>>>    Thanks,
>>>
>>>       Matt
>>>
>>>
>>>> 3)  Here is the matrix part of the code:
>>>>
>>>> !Create matrix:
>>>>       call MatCreate(PETSC_COMM_WORLD,A,ierr)
>>>>       call MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,N,N,ierr)
>>>>       call MatSetType(A,MATMPIAIJ,ierr)
>>>>       call MatSetFromOptions(A,ierr)
>>>>       !print*,'3nrt: ',3*nr*nt
>>>>       i = 16
>>>>       IF(size .eq. 1) THEN
>>>>           j = 0
>>>>       ELSE
>>>>           j = 8
>>>>       END IF
>>>>       call MatMPIAIJSetPreallocation(A,i,PETSC_NULL_INTEGER,
>>>>      &                               j,PETSC_NULL_INTEGER,ierr)
>>>>
>>>>       !Do not call this if using preallocation!
>>>>       !call MatSetUp(A,ierr)
>>>>
>>>>       call MatGetOwnershipRange(A,i,j,ierr)
>>>>       print*,'Rank ',rank,' has range ',i,' and ',j
>>>>
>>>>       !Get MAS matrix in CSR format (random numbers for now):
>>>>       IF (rank .eq. 0) THEN
>>>>          call GET_RAND_MAS_MATRIX(CSR_A,CSR_AI,CSR_AJ,nr,nt,np,M)
>>>>          print*,'Number of non-zero entries in matrix:',M
>>>>           !Store matrix values one-by-one (inefficient:  better way
>>>>              !   more complicated - implement later)
>>>>
>>>>          DO i=1,N
>>>>            !print*,'numofnonzerosinrowi:',CSR_AJ(i+1)-CSR_AJ(i)+1
>>>>             DO j=CSR_AJ(i)+1,CSR_AJ(i+1)
>>>>                call MatSetValue(A,i-1,CSR_AI(j),CSR_A(j),
>>>>      &                               INSERT_VALUES,ierr)
>>>>
>>>>             END DO
>>>>          END DO
>>>>          print*,'Done setting matrix values...'
>>>>       END IF
>>>>
>>>>       !Assemble matrix A across all cores:
>>>>       call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY,ierr)
>>>>       print*,'between assembly'
>>>>       call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY,ierr)
>>>>
>>>>
>>>>
>>>> A couple things to note:
>>>> a) my CSR_AJ is what most peaople would call ai etc
>>>> b) my CSR array values are 0-index but the arrays are 1-indexed.
>>>>
>>>>
>>>>
>>>> Here is the run with one processor (-n 1):
>>>>
>>>> sumseq:PETSc sumseq$ valgrind mpiexec -n 1 ./petsctest -mat_view_info
>>>> ==26297== Memcheck, a memory error detector
>>>> ==26297== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et
>>>> al.
>>>> ==26297== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright
>>>> info
>>>> ==26297== Command: mpiexec -n 1 ./petsctest -mat_view_info
>>>> ==26297==
>>>> UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
>>>>  N:        46575
>>>>  cores:            1
>>>>  MPI TEST:  My rank is:           0
>>>>  Rank            0  has range            0  and        46575
>>>>  Number of non-zero entries in matrix:      690339
>>>>  Done setting matrix values...
>>>>  between assembly
>>>> Matrix Object: 1 MPI processes
>>>>   type: mpiaij
>>>>   rows=46575, cols=46575
>>>>   total: nonzeros=690339, allocated nonzeros=745200
>>>>   total number of mallocs used during MatSetValues calls =0
>>>>     not using I-node (on process 0) routines
>>>>  PETSc y=Ax time:      367.9164     nsec/mp.
>>>>  PETSc y=Ax flops:    0.2251188     GFLOPS.
>>>> ==26297==
>>>> ==26297== HEAP SUMMARY:
>>>> ==26297==     in use at exit: 139,984 bytes in 65 blocks
>>>> ==26297==   total heap usage: 938 allocs, 873 frees, 229,722 bytes
>>>> allocated
>>>> ==26297==
>>>> ==26297== LEAK SUMMARY:
>>>> ==26297==    definitely lost: 0 bytes in 0 blocks
>>>> ==26297==    indirectly lost: 0 bytes in 0 blocks
>>>> ==26297==      possibly lost: 0 bytes in 0 blocks
>>>> ==26297==    still reachable: 139,984 bytes in 65 blocks
>>>> ==26297==         suppressed: 0 bytes in 0 blocks
>>>> ==26297== Rerun with --leak-check=full to see details of leaked memory
>>>> ==26297==
>>>> ==26297== For counts of detected and suppressed errors, rerun with: -v
>>>> ==26297== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
>>>> sumseq:PETSc sumseq$
>>>>
>>>>
>>>>
>>>> Here is the run with 2 processors (-n 2)
>>>>
>>>> sumseq:PETSc sumseq$ valgrind mpiexec -n 2 ./petsctest -mat_view_info
>>>> ==26301== Memcheck, a memory error detector
>>>> ==26301== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et
>>>> al.
>>>> ==26301== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright
>>>> info
>>>> ==26301== Command: mpiexec -n 2 ./petsctest -mat_view_info
>>>> ==26301==
>>>> UNKNOWN task message [id 3403, to mach_task_self(), reply 0x2803]
>>>>  N:        46575
>>>>  cores:            2
>>>>  MPI TEST:  My rank is:           0
>>>>  MPI TEST:  My rank is:           1
>>>>  Rank            0  has range            0  and        23288
>>>>  Rank            1  has range        23288  and        46575
>>>>  Number of non-zero entries in matrix:      690339
>>>>  Done setting matrix values...
>>>>  between assembly
>>>>  between assembly
>>>> [1]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>>> probably memory access out of range
>>>> [1]PETSC ERROR: Try option -start_in_debugger or
>>>> -on_error_attach_debugger
>>>> [1]PETSC ERROR: or see
>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSCERROR: or try
>>>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>>>> corruption errors
>>>> [1]PETSC ERROR: likely location of problem given in stack below
>>>> [1]PETSC ERROR: ---------------------  Stack Frames
>>>> ------------------------------------
>>>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not
>>>> available,
>>>> [1]PETSC ERROR:       INSTEAD the line number of the start of the
>>>> function
>>>> [1]PETSC ERROR:       is given.
>>>> [1]PETSC ERROR: [1] MatStashScatterGetMesg_Private line 609
>>>> /usr/local/petsc-3.3-p2/src/mat/utils/matstash.c
>>>> [1]PETSC ERROR: [1] MatAssemblyEnd_MPIAIJ line 646
>>>> /usr/local/petsc-3.3-p2/src/mat/impls/aij/mpi/mpiaij.c
>>>> [1]PETSC ERROR: [1] MatAssemblyEnd line 4857
>>>> /usr/local/petsc-3.3-p2/src/mat/interface/matrix.c
>>>> [1]PETSC ERROR: --------------------- Error Message
>>>> ------------------------------------
>>>> [1]PETSC ERROR: Signal received!
>>>> [1]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> [1]PETSC ERROR: Petsc Release Version 3.3.0, Patch 2, Fri Jul 13
>>>> 15:42:00 CDT 2012
>>>> [1]PETSC ERROR: See docs/changes/index.html for recent updates.
>>>> [1]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>>>> [1]PETSC ERROR: See docs/index.html for manual pages.
>>>> [1]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> [1]PETSC ERROR: ./petsctest on a arch-darw named sumseq.predsci.com by
>>>> sumseq Fri Jul 27 13:34:36 2012
>>>> [1]PETSC ERROR: Libraries linked from
>>>> /usr/local/petsc-3.3-p2/arch-darwin-c-debug/lib
>>>> [1]PETSC ERROR: Configure run at Fri Jul 27 13:28:26 2012
>>>> [1]PETSC ERROR: Configure options --with-debugging=1
>>>> [1]PETSC ERROR:
>>>> ------------------------------------------------------------------------
>>>> [1]PETSC ERROR: User provided function() line 0 in unknown directory
>>>> unknown file
>>>> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
>>>> [cli_1]: aborting job:
>>>> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
>>>> ==26301==
>>>> ==26301== HEAP SUMMARY:
>>>> ==26301==     in use at exit: 139,984 bytes in 65 blocks
>>>> ==26301==   total heap usage: 1,001 allocs, 936 frees, 234,886 bytes
>>>> allocated
>>>> ==26301==
>>>> ==26301== LEAK SUMMARY:
>>>> ==26301==    definitely lost: 0 bytes in 0 blocks
>>>> ==26301==    indirectly lost: 0 bytes in 0 blocks
>>>> ==26301==      possibly lost: 0 bytes in 0 blocks
>>>> ==26301==    still reachable: 139,984 bytes in 65 blocks
>>>> ==26301==         suppressed: 0 bytes in 0 blocks
>>>> ==26301== Rerun with --leak-check=full to see details of leaked memory
>>>> ==26301==
>>>> ==26301== For counts of detected and suppressed errors, rerun with: -v
>>>> ==26301== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)
>>>> sumseq:PETSc sumseq$
>>>>
>>>>
>>>>
>>>>  - Ron
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 27, 2012 at 1:19 PM, Jed Brown <jedbrown at mcs.anl.gov>wrote:
>>>>
>>>>> 1. Check for memory leaks using Valgrind.
>>>>>
>>>>> 2. Be sure to run --with-debugging=1 (the default) when trying to find
>>>>> the error.
>>>>>
>>>>> 3. Send the full error message and the relevant bit of code.
>>>>>
>>>>>
>>>>> On Fri, Jul 27, 2012 at 3:17 PM, Ronald M. Caplan <caplanr at predsci.com
>>>>> > wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am running a simple test code which takes a sparse AIJ matrix in
>>>>>> PETSc and multiplies it by a vector.
>>>>>>
>>>>>> The matrix is defined as an AIJ MPI matrix.
>>>>>>
>>>>>> When I run the program on a single core, it runs fine.
>>>>>>
>>>>>> When I run it using MPI with multiple threads (I am on a 4-core,
>>>>>> 8-thread MAC) I can get the code to run correctly for matrices under a
>>>>>> certain size (2880 X 2880), but when the matrix is set to be larger, the
>>>>>> code crashes with a segfault and the error says it was in the
>>>>>> MatAssemblyEnd().  Sometimes it works with -n 2, but typically it always
>>>>>> crashes when using multi-core.
>>>>>>
>>>>>> Any ideas on what it could be?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Ron Caplan
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120730/aaf7f9d3/attachment.html>


More information about the petsc-users mailing list