[petsc-users] MatTransposeMatMult ends up with an MPI error
Thomas Witkowski
thomas.witkowski at tu-dresden.de
Wed Oct 17 14:57:05 CDT 2012
Am 17.10.2012 17:50, schrieb Hong Zhang:
> Thomas:
>
> Does this occur only for large matrices?
> Can you dump your matrices into petsc binary files
> (e.g., A.dat, B.dat) and send to us for debugging?
>
> Lately, we added a new implementation of MatTransposeMatMult() in
> petsc-dev
> which is shown much faster than released MatTransposeMatMult().
> You might give it a try by
> 1. install petsc-dev (see
> http://www.mcs.anl.gov/petsc/developers/index.html)
> 2. run your code with option '-mattransposematmult_viamatmatmult 1'
> Let us know what you get.
>
I checked the problem with petsc-dev. Here, the code just hangs
somewhere inside MatTransposeMatMult. I checked, what MatTranspose does
on the corresponding matrix and the behavior is the same. I extracted
the matrix from my simulations, its of size 123,432 x 1,533,726 and very
sparse (2 to 8 nnzs per row). I'm sorry, but this is the smallest matrix
where I found the problem (I will send the matrix file to petsc-maint).
I wrote some small piece of code, that just reads the matrix and runs
MatTranspose. With 1 mpi task, it works fine. With small number of mpi
tasks (so around 8), I get the following error message:
[1]PETSC ERROR:
------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the
batch system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see
http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC
ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to
find memory corruption errors
[1]PETSC ERROR: likely location of problem given in stack below
[1]PETSC ERROR: --------------------- Stack Frames
------------------------------------
[1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[1]PETSC ERROR: INSTEAD the line number of the start of the function
[1]PETSC ERROR: is given.
[1]PETSC ERROR: [1] PetscSFReduceEnd line 1259 src/sys/sf/sf.c
[1]PETSC ERROR: [1] MatTranspose_MPIAIJ line 2045
src/mat/impls/aij/mpi/mpiaij.c
[1]PETSC ERROR: [1] MatTranspose line 4341 src/mat/interface/matrix.c
With 32 mpi tasks, which I also use in my simulation, the code hangs in
MatTranspose.
If there is something more I can do to help you finding the problem,
please let me know!
Thomas
> Hong
>
> My code makes use of the function MatTransposeMatMult, and usually
> it work fine! For some larger input data, it now stops with a lot
> of MPI errors:
>
> fatal error in PMPI_Barrier: Other MPI error, error stack:
> PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed
> MPIR_Barrier(82)...:
> MPI_Waitall(261): MPI_Waitall(count=9, req_array=0xa787ba0,
> status_array=0xa789240) failed
> MPI_Waitall(113): The supplied request in array element 8 was
> invalid (kind=0)
> Fatal error in PMPI_Barrier: Other MPI error, error stack:
> PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed
> MPIR_Barrier(82)...:
> mpid_irecv_done(98): read from socket failed - request
> state:recv(pde)done
>
>
> Here is the stack print from the debugger:
>
> 6, MatTransposeMatMult (matrix.c:8907)
> 6, MatTransposeMatMult_MPIAIJ_MPIAIJ
> (mpimatmatmult.c:809)
> 6, MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ
> (mpimatmatmult.c:1136)
> 6, PetscGatherMessageLengths2 (mpimesg.c:213)
> 6, PMPI_Waitall
> 6, MPIR_Err_return_comm
> 6, MPID_Abort
>
>
> I use PETSc 3.3-p3. Any idea whether this is or could be related
> to some bug in PETSc or whether I make wrong use of the function
> in some way?
>
> Thomas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121017/8ddf1a06/attachment-0001.html>
More information about the petsc-users
mailing list