[petsc-dev] [petsc-maint #75297] Issue when saving an MPI dense matrix

Barry Smith bsmith at mcs.anl.gov
Fri Jun 3 11:30:58 CDT 2011


  Simone,

     This is because we are trying to send messages too long for MPI to handle. This is a problem for MPI for two reasons

1) MPI "count" arguments are always int, when we use 64 bit PetscInt (because of the --with-64-bit-indices PetscInt becomes long long int) this means we "may" be passing values too large as count values to MPI and because C/C++ automatically castes long long int arguments to int it ends up passing garbage values to the MPI libraries.  Now I say "may" because this is only a problem if a count happens to be so large it won't fit in an int.

2) Even if the "count" values passed to MPI are correct int values, we've found that none of the MPI implementations handle "counts" correctly when they are within a factor of 4 or 8 of the largest value allowed in an int. This is because the MPI implementations improperly do things like convert from count to byte size by multiplying by sizeof(the type being passed) and store the result in an int (where it won't fit). We've harassed the MPICH folks about this but they consider it a low priority to fix.

  In a few places in PETSc where it uses MPI calls we have started to be very careful and make sure that we only use PetscMPIInt as count arguments to MPI calls and explicitly check that we can caste from PetscInt to PetscMPIInt and generate an error if the result won't fit. We also replace a single call to MPI_Send() and MPI_Recv() with our own routines MPILong_Send() and MPILong_Recv() that make several calls to MPI_Send() and MPI_Recv() each sufficiently small enough for MPI to handle.
For example in MatView_MPIAIJ_Binary() we've updated the code to handle absurdly large matrices that cannot use the MPI calls directly.

  I will update the viewer and loader for MPIDense matrices to work correctly, but you will have to test it in petsc-dev (not petsc-3.1) Also, I have no machines with enough memory to do proper testing so you will need to test the code for me.


   Barry




On Jun 3, 2011, at 9:31 AM, Simone Re wrote:

> Dear Experts,
>                I'm facing an issue when saving an MPI dense matrix.
> 
> My matrix has:
> 
> -          5085 rows
> 
> -          737352 columns
> and the crash occurs when I run the program using 12 CPUs (for instance with 16 CPUs everything is fine).
> 
> I built my program using both mvapich2 and Intel MPI 4 and it crashes in both cases.
> 
> When I run my original program built against Intel MPI 4 I get the following.
> 
> [4]PETSC ERROR: MatView_MPIDense_Binary() line 658 in src/mat/impls/dense/mpi/mpidense.c
> [4]PETSC ERROR: MatView_MPIDense() line 780 in src/mat/impls/dense/mpi/mpidense.c
> [4]PETSC ERROR: MatView() line 717 in src/mat/interface/matrix.c
> [4]PETSC ERROR: ------------------------------------------------------------------------
> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [4]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
> [4]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
> [4]PETSC ERROR: to get more information on the crash.
> [4]PETSC ERROR: --------------------- Error Message ------------------------------------
> [4]PETSC ERROR: Signal received!
> [4]PETSC ERROR: ------------------------------------------------------------------------
> [4]PETSC ERROR: Petsc Release Version 3.1.0, Patch 7, Mon Dec 20 14:26:37 CST 2010
> [4]PETSC ERROR: See docs/changes/index.html for recent updates.
> [4]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [4]PETSC ERROR: See docs/index.html for manual pages.
> [4]PETSC ERROR: ------------------------------------------------------------------------
> ...
> 
> Unfortunately, when I run the sample program attached, I get the crash but I don't get the same error message.
> I've attached also:
> 
> -           the error I get from the sample program (built using mvapich2)
> 
> -          configure.log
> 
> -          the command line I used to invoke the sample program
> 
> Thanks and regards,
>                Simone Re
> 
> Simone Re
> Team Leader
> Integrated EM Center of Excellence
> WesternGeco GeoSolutions
> via Celeste Clericetti 42/A
> 20133 Milano - Italy
> +39 02 . 266 . 279 . 246   (direct)
> +39 02 . 266 . 279 . 279   (fax)
> sre at slb.com<mailto:sre at slb.com>
> 
> 
> Dear Experts,
> 
>                 I’m facing an issue when saving an MPI dense matrix.
> 
>  
> 
> My matrix has:
> 
> -          5085 rows
> 
> -          737352 columns
> 
> and the crash occurs when I run the program using 12 CPUs (for instance with 16 CPUs everything is fine).
> 
>  
> 
> I built my program using both mvapich2 and Intel MPI 4 and it crashes in both cases.
> 
>  
> 
> When I run my original program built against Intel MPI 4 I get the following.
> 
>  
> 
> [4]PETSC ERROR: MatView_MPIDense_Binary() line 658 in src/mat/impls/dense/mpi/mpidense.c
> 
> [4]PETSC ERROR: MatView_MPIDense() line 780 in src/mat/impls/dense/mpi/mpidense.c
> 
> [4]PETSC ERROR: MatView() line 717 in src/mat/interface/matrix.c
> 
> [4]PETSC ERROR: ------------------------------------------------------------------------
> 
> [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
> 
> [4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> 
> [4]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[4]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
> 
> [4]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
> 
> [4]PETSC ERROR: to get more information on the crash.
> 
> [4]PETSC ERROR: --------------------- Error Message ------------------------------------
> 
> [4]PETSC ERROR: Signal received!
> 
> [4]PETSC ERROR: ------------------------------------------------------------------------
> 
> [4]PETSC ERROR: Petsc Release Version 3.1.0, Patch 7, Mon Dec 20 14:26:37 CST 2010
> 
> [4]PETSC ERROR: See docs/changes/index.html for recent updates.
> 
> [4]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> 
> [4]PETSC ERROR: See docs/index.html for manual pages.
> 
> [4]PETSC ERROR: ------------------------------------------------------------------------
> 
>> 
>  
> 
> Unfortunately, when I run the sample program attached, I get the crash but I don’t get the same error message.
> 
> I’ve attached also:
> 
> -           the error I get from the sample program (built using mvapich2)
> 
> -          configure.log
> 
> -          the command line I used to invoke the sample program
> 
>  
> 
> Thanks and regards,
> 
>                 Simone Re
> 
>  
> 
> Simone Re
> 
> Team Leader
> 
> Integrated EM Center of Excellence
> 
> 
> WesternGeco GeoSolutions
> 
> via Celeste Clericetti 42/A
> 
> 20133 Milano - Italy
> 
> 
> +39 02 . 266 . 279 . 246   (direct)
> 
> +39 02 . 266 . 279 . 279   (fax)
> 
> sre at slb.com
> 
>  
> 
> <for_petsc_team.tar.bz2>




More information about the petsc-dev mailing list