[petsc-users] 'Inserting a new nonzero' issue on a reassembled matrix in parallel

Thibaut Appel t.appel17 at imperial.ac.uk
Thu Oct 24 05:09:50 CDT 2019


Hi Matthew,

Thanks for having a look, your example runs just like mine in Fortran.

In serial, the value (0.0,0.0) was inserted whereas it shouldn't have.

In parallel, you'll see that an error "Inserting a new nonzero at global 
row/column" is triggered.

In both cases, MatSetValues tries to insert a zero value whereas 
IGNORE_ZERO_ENTRIES was set. That's what Barry is looking into, if I'm 
not mistaken.


Thibaut


On 24/10/2019 02:31, Matthew Knepley wrote:
> On Tue, Oct 22, 2019 at 1:37 PM Thibaut Appel 
> <t.appel17 at imperial.ac.uk <mailto:t.appel17 at imperial.ac.uk>> wrote:
>
>     Hi both,
>
>     Please find attached a tiny example (in Fortran, sorry Matthew)
>     that - I think - reproduces the problem we mentioned.
>
>     Let me know.
>
> Okay, I converted to C so I could understand, and it runs fine for me:
>
> master *:~/Downloads/tmp$ PETSC_ARCH=arch-master-complex-debug make main
>
> /PETSc3/petsc/bin/mpicc -o main.o -c -Wall -Wwrite-strings 
> -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector 
> -Qunused-arguments -fvisibility=hidden -g3 
> -I/PETSc3/petsc/petsc-dev/include 
> -I/PETSc3/petsc/petsc-dev/arch-master-complex-debug/include 
> -I/PETSc3/petsc/include -I/opt/X11/include`pwd`/main.c
>
> /PETSc3/petsc/bin/mpicc -Wl,-multiply_defined,suppress 
> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs 
> -Wl,-search_paths_first -Wl,-no_compact_unwind-Wall -Wwrite-strings 
> -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector 
> -Qunused-arguments -fvisibility=hidden -g3-o main main.o 
> -Wl,-rpath,/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -L/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -Wl,-rpath,/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -L/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -Wl,-rpath,/PETSc3/petsc/lib -L/PETSc3/petsc/lib 
> -Wl,-rpath,/opt/X11/lib -L/opt/X11/lib 
> -Wl,-rpath,/usr/local/lib/gcc/x86_64-apple-darwin10.4.0/4.6.0 
> -L/usr/local/lib/gcc/x86_64-apple-darwin10.4.0/4.6.0 
> -Wl,-rpath,/usr/local/lib -L/usr/local/lib -lpetsc -lfftw3_mpi -lfftw3 
> -llapack -lblas -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 
> -lchaco -lparmetis -lmetis -ltriangle -lz -lX11 -lctetgen -lstdc++ 
> -ldl -lmpichf90 -lpmpich -lmpich -lopa -lmpl -lpthread -lgfortran 
> -lgcc_s.10.5 -lstdc++ -ldl
>
> master *:~/Downloads/tmp$ ./main
>
> After first assembly:
>
> Mat Object: 1 MPI processes
>
> type: seqaij
>
> row 0: (0, 1. + 1. i)
>
> row 1: (1, 1. + 1. i)
>
> row 2: (2, 1. + 1. i)
>
> row 3: (3, 1. + 1. i)
>
> row 4: (4, 1. + 1. i)
>
> row 5: (5, 1. + 1. i)
>
> row 6: (6, 1. + 1. i)
>
> row 7: (7, 1. + 1. i)
>
> row 8: (8, 1. + 1. i)
>
> row 9: (9, 1. + 1. i)
>
> After second assembly:
>
> Mat Object: 1 MPI processes
>
> type: seqaij
>
> row 0: (0, 1. + 1. i)
>
> row 1: (1, 1. + 1. i)
>
> row 2: (2, 1. + 1. i)
>
> row 3: (3, 1. + 1. i)
>
> row 4: (4, 1. + 1. i)
>
> row 5: (5, 1. + 1. i)
>
> row 6: (6, 1. + 1. i)
>
> row 7: (7, 1. + 1. i)
>
> row 8: (8, 1. + 1. i)
>
> row 9: (9, 1. + 1. i)
>
> row 0 col: 9 val: 0. + 0. i
>
> I attach my code.  So then I ran your Fortran:
>
> /PETSc3/petsc/bin/mpif90 -c -Wall -ffree-line-length-0 
> -Wno-unused-dummy-argument -Wno-unused-variable 
> -g-I/PETSc3/petsc/petsc-dev/include 
> -I/PETSc3/petsc/petsc-dev/arch-master-complex-debug/include 
> -I/PETSc3/petsc/include -I/opt/X11/include-o main2.o main2.F90
>
> /PETSc3/petsc/bin/mpif90 -Wl,-multiply_defined,suppress 
> -Wl,-multiply_defined -Wl,suppress -Wl,-commons,use_dylibs 
> -Wl,-search_paths_first -Wl,-no_compact_unwind-Wall 
> -ffree-line-length-0 -Wno-unused-dummy-argument -Wno-unused-variable 
> -g -o main2 main2.o 
> -Wl,-rpath,/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -L/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -Wl,-rpath,/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -L/PETSc3/petsc/petsc-dev/arch-master-complex-debug/lib 
> -Wl,-rpath,/PETSc3/petsc/lib -L/PETSc3/petsc/lib 
> -Wl,-rpath,/opt/X11/lib -L/opt/X11/lib 
> -Wl,-rpath,/usr/local/lib/gcc/x86_64-apple-darwin10.4.0/4.6.0 
> -L/usr/local/lib/gcc/x86_64-apple-darwin10.4.0/4.6.0 
> -Wl,-rpath,/usr/local/lib -L/usr/local/lib -lpetsc -lfftw3_mpi -lfftw3 
> -llapack -lblas -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 
> -lchaco -lparmetis -lmetis -ltriangle -lz -lX11 -lctetgen -lstdc++ 
> -ldl -lmpichf90 -lpmpich -lmpich -lopa -lmpl -lpthread -lgfortran 
> -lgcc_s.10.5 -lstdc++ -ldl
>
> master *:~/Downloads/tmp$ ./main2
>
> After first assembly:
>
> Mat Object: 1 MPI processes
>
> type: seqaij
>
> row 0: (0, 1.)
>
> row 1: (1, 1.)
>
> row 2: (2, 1.)
>
> row 3: (3, 1.)
>
> row 4: (4, 1.)
>
> row 5: (5, 1.)
>
> row 6: (6, 1.)
>
> row 7: (7, 1.)
>
> row 8: (8, 1.)
>
> row 9: (9, 1.)
>
> After second assembly:
>
> Mat Object: 1 MPI processes
>
> type: seqaij
>
> row 0: (0, 1.)
>
> row 1: (1, 1.)
>
> row 2: (2, 1.)
>
> row 3: (3, 1.)
>
> row 4: (4, 1.)
>
> row 5: (5, 1.)
>
> row 6: (6, 1.)
>
> row 7: (7, 1.)
>
> row 8: (8, 1.)
>
> row 9: (9, 1.)
>
> row:0 col:9 val:0.000000000000000000E+000.000000000000000000E+00
>
>
> I am not seeing an error. Am I not running it correctly?
>
>   Thanks,
>
>      MAtt
>
>     Thibaut
>
>
>     On 22/10/2019 17:48, Matthew Knepley wrote:
>>     On Tue, Oct 22, 2019 at 12:43 PM Thibaut Appel via petsc-users
>>     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>
>>         Hi Hong,
>>
>>         Thank you for having a look, I copied/pasted your code
>>         snippet into ex28.c and the error indeed appears if you
>>         change that col[0]. That's because you did not allow a new
>>         non-zero location in the matrix with the option
>>         MAT_NEW_NONZERO_LOCATION_ERR.
>>
>>         I spent the day debugging the code and already checked my
>>         calls to MatSetValues:
>>
>>         For all MatSetValues calls corresponding to the row/col
>>         location in the error messages in the subsequent assembly,
>>         the numerical value associated with that row/col was exactly
>>         (0.0,0.0) (complex arithmetic) so it shouldn't be inserted
>>         w.r.t. the option MAT_IGNORE_ZERO_ENTRIES. It seems
>>         MatSetValues still did it anyway.
>>
>>     Okay, lets solve this problem first. You say that
>>
>>       - You called MatSetOption(A,  MAT_IGNORE_ZERO_ENTRIES, PETSC_TRUE)
>>       - You called MatSetValues(A, ,,,, ADD_VALUES, ..., val) and val
>>     had a complex 0 in it
>>       - PETSc tried to insert the complex 0
>>
>>     This should be really easy to test in a tiny example. Do you mind
>>     making it? If its broken, I will fix it.
>>
>>       Thanks,
>>
>>         Matt
>>
>>         I was able to solve the problem by adding
>>         MatSetOption(L,MAT_NEW_NONZERO_LOCATIONS,PETSC_FALSE) after
>>         my first assembly. However I don't know why it fixed it as
>>         the manual seems to say this is just for efficiency purposes
>>
>>         It "should be specified after the first matrix has been fully
>>         assembled. This option ensures that certain data structures
>>         and communication information will be reused (instead of
>>         regenerated during successive steps, thereby increasing
>>         efficiency"
>>
>>         So I'm still puzzled by why I got that error in the first
>>         place. Unless "regenerated" implies resetting some attributes
>>         of the preallocated non-zero structure / first assembly?
>>
>>
>>         Thibaut
>>
>>
>>         On 22/10/2019 17:06, Zhang, Hong wrote:
>>>         Thibaut:
>>>         Check your code on MatSetValues(), likely you set a value
>>>         "to a new nonzero at global row/column (200, 160) into
>>>         matrix" L.
>>>         I tested petsc/src/mat/examples/tests/ex28.c by adding
>>>         @@ -95,6 +95,26 @@ int main(int argc,char **args)
>>>            /* Compute numeric factors using same F, then solve */
>>>            for (k = 0; k < num_numfac; k++) {
>>>              /* Get numeric factor of A[k] */
>>>         +    if (k==0) {
>>>         +      ierr = MatZeroEntries(A[0]);CHKERRQ(ierr);
>>>         +      for (i=rstart; i<rend; i++) {
>>>         +        col[0] = i-1; col[1] = i; col[2] = i+1;
>>>         +        if (i == 0) {
>>>         +          ierr =
>>>         MatSetValues(A[k],1,&i,2,col+1,value+1,INSERT_VALUES);CHKERRQ(ierr);
>>>         +        } else if (i == N-1) {
>>>         +          ierr =
>>>         MatSetValues(A[k],1,&i,2,col,value,INSERT_VALUES);CHKERRQ(ierr);
>>>         +        } else {
>>>         +          ierr =
>>>         MatSetValues(A[k],1,&i,3,col,value,INSERT_VALUES);CHKERRQ(ierr);
>>>         +        }
>>>         +      }
>>>         +      if (!rank) {
>>>         +      i = N - 1; col[0] = N - 1;
>>>         +        ierr =
>>>         MatSetValues(A[k],1,&i,1,col,value,INSERT_VALUES);CHKERRQ(ierr);
>>>         +      }
>>>         +      ierr =
>>>         MatAssemblyBegin(A[k],MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
>>>         +      ierr =
>>>         MatAssemblyEnd(A[k],MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
>>>         +    }
>>>         +
>>>
>>>         It works in both sequential and parallel. If I set col[0] =
>>>         0, then I get the same error as yours.
>>>         Hong
>>>
>>>             Dear PETSc developers,
>>>
>>>             I'm extending a validated matrix preallocation/assembly
>>>             part of my code to solve multiple linear systems with
>>>             MUMPS at each iteration of a main loop, following the
>>>             example src/mat/examples/tests/ex28.c that Hong Zhang
>>>             added a few weeks ago. The difference is that I'm using
>>>             just 1 matrix to solve different systems.
>>>
>>>             I'm trying to investigate a nasty bug arising when I try
>>>             to assemble "for a second time" that MPIAIJ matrix. The
>>>             issue arises only in parallel, serial works fine.
>>>
>>>             Creating 1 MPIAIJ matrix, preallocating it "perfectly"
>>>             with the case where I have the fewest zero entries in
>>>             the non-zero structure, before getting its symbolic
>>>             factorization.
>>>
>>>             Further in the main loop, I'm solely changing its
>>>             entries *retaining the non-zero structure*.
>>>
>>>             Here is the simplified Fortran code I'm using:
>>>
>>>             ! Fill (M,N) case to ensure all non-zero entries are
>>>             preallocated
>>>             CALL set_equations(M,N)
>>>
>>>             CALL alloc_matrix(L)
>>>               ! --> Call
>>>             MatSeqAIJSetPreallocation/MatMPIAIJSetPreallocation
>>>               ! --> Sets MAT_IGNORE_ZERO_ENTRIES,
>>>             MAT_NEW_NONZERO_ALLOCATION_ERR, MAT_NO_OFF_PROC_ENTRIES
>>>             to true
>>>
>>>             CALL assemble_matrix(L)
>>>               ! --> Calls MatSetValues with ADD_VALUES
>>>               ! --> Call MatAssemblyBegin/MatAssemblyEnd
>>>
>>>             ! Tell PETSc that new non-zero insertions in matrix are
>>>             forbidden
>>>             CALL
>>>             MatSetOption(L,MAT_NEW_NONZERO_LOCATION_ERR,PETSC_TRUE,ierr)
>>>
>>>             CALL set_mumps_parameters()
>>>
>>>             ! Get symbolic LU factorization using MUMPS
>>>             CALL MatGetFactor(L,MATSOLVERMUMPS,MAT_FACTOR_LU,F,ierr)
>>>             CALL MatGetOrdering(L,MATORDERINGNATURAL,rperm,cperm,ierr)
>>>             CALL MatLUFactorSymbolic(F,L,rperm,cperm,info,ierr)
>>>
>>>             CALL initialize_right_hand_sides()
>>>
>>>             ! Zero matrix entries
>>>             CALL MatZeroEntries(L,ierr)
>>>
>>>             ! Main loop
>>>             DO itr=1, maxitr
>>>
>>>               DO m = 1, M
>>>                 DO n = 1, N
>>>
>>>                 CALL set_equations(m,n)
>>>                 CALL assemble_matrix(L) ! ERROR HERE when m=1, n=1,
>>>             CRASH IN MatSetValues call
>>>
>>>                 ! Solving the linear system associated with (m,n)
>>>                 CALL MatLUFactorNumeric(F,L,info,ierr)
>>>                 CALL MatSolve(F,v_rhs(m,n),v_sol(m,n),ierr)
>>>
>>>                 ! Process v_rhs's from v_sol's for next iteration
>>>
>>>                 CALL MatZeroEntries(L,ierr)
>>>
>>>                 END DO
>>>               END DO
>>>
>>>             END DO
>>>
>>>
>>>             Testing on a small case, the error I get is
>>>
>>>             [1]PETSC ERROR: --------------------- Error Message
>>>             --------------------------------------------------------------
>>>             [1]PETSC ERROR: Argument out of range
>>>             [1]PETSC ERROR: Inserting a new nonzero at global
>>>             row/column (200, 160) into matrix
>>>             [1]PETSC ERROR: See
>>>             https://www.mcs.anl.gov/petsc/documentation/faq.html
>>>             <https://www.mcs.anl.gov/petsc/documentation/faq.html>
>>>             for trouble shooting.
>>>             [1]PETSC ERROR: Petsc Release Version 3.12.0, unknown
>>>             [1]PETSC ERROR: Configure options
>>>             --PETSC_ARCH=cplx_gcc_debug --with-scalar-type=complex
>>>             --with-precision=double --with-debugging=1
>>>             --with-valgrind=1 --with-debugger=gdb
>>>             --with-fortran-kernels=1 --download-mpich
>>>             --download-fblaslapack --download-scalapack
>>>             --download-metis --download-parmetis --download-ptscotch
>>>             --download-mumps --download-slepc --COPTFLAGS="-O0 -g"
>>>             --CXXOPTFLAGS="-O0 -g" --FOPTFLAGS="-O0 -g -fbacktrace"
>>>             [1]PETSC ERROR: #1 MatSetValues_MPIAIJ() line 634 in
>>>             /home/thibaut/Packages/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>             [1]PETSC ERROR: #2 MatSetValues() line 1375 in
>>>             /home/thibaut/Packages/petsc/src/mat/interface/matrix.c
>>>             [1]PETSC ERROR: #3 User provided function() line 0 in
>>>             User file
>>>             application called MPI_Abort(MPI_COMM_SELF, 63) - process 0
>>>
>>>
>>>             which I don't understand. That element was not in the
>>>             non-zero structure and wasn't preallocated. I printed
>>>             the value to be inserted at this location (200,160) and
>>>             it is exactly (0.0000000000000000,0.0000000000000000) so
>>>             this entry should not be inserted due to
>>>             MAT_IGNORE_ZERO_ENTRIES, however it seems it is. I'm
>>>             using ADD_VALUES in MatSetValues but it is the only call
>>>             where (200,160) is inserted.
>>>
>>>
>>>                 - I zero the matrix entries with MatZeroEntries
>>>             which retains the non-zero structure (checked when I
>>>             print the matrix) but tried to comment the corresponding
>>>             calls.
>>>
>>>                 - I tried to set MAT_NEW_NONZERO_LOCATION_ERR AND
>>>             MAT_NEW_NONZERO_ALLOCATION_ERR to PETSC_FALSE without
>>>             effect.
>>>
>>>
>>>             Perhaps there's something fundamentally wrong in my
>>>             approach, in any case would you have any suggestions to
>>>             identify the exact problem?
>>>
>>>             Using PETSc 3.12.0. Thanks for your support,
>>>
>>>
>>>             Thibaut
>>>
>>
>>
>>     -- 
>>     What most experimenters take for granted before they begin their
>>     experiments is infinitely more interesting than any results to
>>     which their experiments lead.
>>     -- Norbert Wiener
>>
>>     https://www.cse.buffalo.edu/~knepley/
>>     <http://www.cse.buffalo.edu/~knepley/>
>
>
>
> -- 
> What most experimenters take for granted before they begin their 
> experiments is infinitely more interesting than any results to which 
> their experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ 
> <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20191024/f6eb06db/attachment-0001.html>


More information about the petsc-users mailing list