[petsc-dev] [petsc-users] MatTransposeMatMult ends up with an MPI error

Barry Smith bsmith at mcs.anl.gov
Wed Oct 17 15:30:42 CDT 2012


On Oct 17, 2012, at 3:23 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> No. The problem is that Open MPI does not fix their critical bugs, they just downgrade them from "blocker" so they can make a release. It's not easy for them to fix because they need to refactor some lower level protocols.

   Ahh yes. It is nice to have a sophisticated bug tracking system; it makes it easy to relabel bugs with a single click to avoid doing work. We should add this to PETSc :-)

   BTW: Shouldn't you have configure detect this issue and turnoff the building of SF or print appropriate error messages so we don't get these confusing petsc-maint that only you can understand?


   Barry

> 
> On Wed, Oct 17, 2012 at 3:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>    Could this problem be related to the fact that ALL mpi implementations do not properly handle large counts that in 64 bit chunks (MPI_DOUBLE) fit into the int but when multiplied by 8  to convert to bytes no longer fit into int and thus rollover and screw up the MPI implementations. The reason I needed to implement MPIULong_Send() and use it in  few places.
> 
>    I know it is a different circumstance but it has the same symptom of failing for big matrices but not small.
> 
> 
>     Barry
> 
> 
> 
> Begin forwarded message:
> 
>> From: Thomas Witkowski <thomas.witkowski at tu-dresden.de>
>> Subject: Re: [petsc-users] MatTransposeMatMult ends up with an MPI error
>> Date: October 17, 2012 2:57:05 PM CDT
>> To: petsc-users at mcs.anl.gov
>> Reply-To: PETSc users list <petsc-users at mcs.anl.gov>
>> 
>> Am 17.10.2012 17:50, schrieb Hong Zhang:
>>> Thomas:
>>> 
>>> Does this occur only for large matrices?
>>> Can you dump your matrices into petsc binary files 
>>> (e.g., A.dat, B.dat) and send to us for debugging?
>>> 
>>> Lately, we added a new implementation of MatTransposeMatMult() in petsc-dev
>>> which is shown much faster than released MatTransposeMatMult().
>>> You might give it a try by
>>> 1. install petsc-dev (see http://www.mcs.anl.gov/petsc/developers/index.html)
>>> 2. run your code with option '-mattransposematmult_viamatmatmult 1'
>>> Let us know what you get.
>>> 
>> I checked the problem with petsc-dev. Here, the code just hangs somewhere inside MatTransposeMatMult. I checked, what MatTranspose does on the corresponding matrix and the behavior is the same. I extracted the matrix from my simulations, its of size 123,432 x 1,533,726 and very sparse (2 to 8 nnzs per row). I'm sorry, but this is the smallest matrix where I found the problem (I will send the matrix file to petsc-maint). I wrote some small piece of code, that just reads the matrix and runs MatTranspose. With 1 mpi task, it works fine. With small number of mpi tasks (so around 8), I get the following error message:
>> 
>> [1]PETSC ERROR: ------------------------------------------------------------------------
>> [1]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the batch system) has told this process to end
>> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
>> [1]PETSC ERROR: likely location of problem given in stack below
>> [1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>> [1]PETSC ERROR:       INSTEAD the line number of the start of the function
>> [1]PETSC ERROR:       is given.
>> [1]PETSC ERROR: [1] PetscSFReduceEnd line 1259 src/sys/sf/sf.c
>> [1]PETSC ERROR: [1] MatTranspose_MPIAIJ line 2045 src/mat/impls/aij/mpi/mpiaij.c
>> [1]PETSC ERROR: [1] MatTranspose line 4341 src/mat/interface/matrix.c
>> 
>> 
>> With 32 mpi tasks, which I also use in my simulation, the code hangs in MatTranspose.
>> 
>> If there is something more I can do to help you finding the problem, please let me know!
>> 
>> Thomas
>> 
>>> Hong
>>> 
>>> My code makes use of the function MatTransposeMatMult, and usually it work fine! For some larger input data, it now stops with a lot of MPI errors:
>>> 
>>> fatal error in PMPI_Barrier: Other MPI error, error stack:
>>> PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed
>>> MPIR_Barrier(82)...:
>>> MPI_Waitall(261): MPI_Waitall(count=9, req_array=0xa787ba0, status_array=0xa789240) failed
>>> MPI_Waitall(113): The supplied request in array element 8 was invalid (kind=0)
>>> Fatal error in PMPI_Barrier: Other MPI error, error stack:
>>> PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed
>>> MPIR_Barrier(82)...:
>>> mpid_irecv_done(98): read from socket failed - request state:recv(pde)done
>>> 
>>> 
>>> Here is the stack print from the debugger:
>>> 
>>> 6,                MatTransposeMatMult (matrix.c:8907)
>>> 6,                  MatTransposeMatMult_MPIAIJ_MPIAIJ (mpimatmatmult.c:809)
>>> 6,                    MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ (mpimatmatmult.c:1136)
>>> 6,                      PetscGatherMessageLengths2 (mpimesg.c:213)
>>> 6,                        PMPI_Waitall
>>> 6,                          MPIR_Err_return_comm
>>> 6,                            MPID_Abort
>>> 
>>> 
>>> I use PETSc 3.3-p3. Any idea whether this is or could be related to some bug in PETSc or whether I make wrong use of the function in some way?
>>> 
>>> Thomas
>>> 
>>> 
>> 
>> 
> 
> 




More information about the petsc-dev mailing list