[petsc-users] Code hangs when calling PetscIntView (MPI, fortran)

Jose E. Roman jroman at dsic.upv.es
Thu May 20 04:06:31 CDT 2021


If you look at the manpage https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscIntView.html you will see that PetscIntView() is collective. This means that all MPI processes must call this function, so it is forbidden to call it within an IF rank==...

Jose

> El 20 may 2021, a las 10:25, dazza simplythebest <sayosale at hotmail.com> escribió:
> 
> Dear All,
>              As part of preparing a code to call the SLEPC eigenvalue solving library,
> I am constructing a matrix in sparse CSR format row-by-row. Just for debugging 
> purposes I write out the column values for a given row, which are stored in a 
> PetscInt allocatable vector, using PetscIntView.
> 
> Everything works fine when the number of MPI processes exactly divide the
> number of rows of the matrix, and so each process owns the same number of rows.
> However, when the number of MPI processes does not exactly divide the
> number of rows of the matrix, and so each process owns a different number of rows,
> the code hangs when it reaches the line that calls PetscIntView.
> To be precise the code hangs on the final row that a process, other than root, owns.
> If I however comment out the call to PetscIntView the code completes without error,
>  and produces the correct eigenvalues (hence we are not missing a row / miswriting a row).
>    Note also that a simple direct writeout of this same array using a plain fortran command
> will write out the array without problem.
> 
> I have attached below a small code that reproduces the problem.
> For this code we have nominally assigned 200 rows to our matrix. The code runs without
> problem using 1,2,4,5,8 or 10 MPI processes, all of which precisely divide 200,
>  but will hang for 3 MPI processes for example.
> For the case of 3 MPI processes the subroutine WHOSE_ROW_IS_IT allocates the rows 
> to each process as :
>   process no       first row           last row       no. of rows
>    0                             1                     66               66
>    1                            67                   133             67
>    2                          134                   200             67
>    
> The code will hang when process 1 calls PetscIntView for its last row, row 133 for example.
> 
> One piece of additional information that may be relevant is that the code does run to completion 
>  without hanging if I comment out the final slepc/MPI finalisation command
>  CALL SlepcFinalize(ierr_pets) 
> (I of course I get ' bad termination' errors, but the otherwise the run is successful.)
>  
>  I would appreciate it if anyone has any ideas on what is going wrong!
>   Many thanks,
>                        Dan.
> 
> 
> code:
> 
>       MODULE ALL_STAB_ROUTINES
>       IMPLICIT NONE
>       CONTAINS
> 
>       SUBROUTINE WHOSE_ROW_IS_IT(ROW_NO, TOTAL_NO_ROWS, NO_PROCESSES,     &
>      &      OWNER)
> !     THIS ROUTINE ALLOCATES ROWS EVENLY BETWEEN mpi PROCESSES
> #include <slepc/finclude/slepceps.h>
>       use slepceps
>       IMPLICIT NONE
>       PetscInt, INTENT(IN) :: ROW_NO, TOTAL_NO_ROWS, NO_PROCESSES
>       PetscInt, INTENT(OUT) :: OWNER
>       PetscInt :: P, REM
> 
>       P = TOTAL_NO_ROWS / NO_PROCESSES ! NOTE INTEGER DIVISION
>       REM = TOTAL_NO_ROWS - P*NO_PROCESSES
>       IF (ROW_NO < (NO_PROCESSES - REM)*P + 1 ) THEN
>         OWNER = (ROW_NO - 1)/P ! NOTE INTEGER DIVISION
>       ELSE
>         OWNER = (  ROW_NO  +   NO_PROCESSES - REM -1 )/(P+1) ! NOTE INTEGER DIVISION
>       ENDIF    
>       END SUBROUTINE WHOSE_ROW_IS_IT
>       END MODULE ALL_STAB_ROUTINES
> 
> 
>       PROGRAM trialer
>       USE MPI
> #include <slepc/finclude/slepceps.h>
>       use slepceps
>       USE ALL_STAB_ROUTINES
>       IMPLICIT NONE
>       PetscMPIInt    rank3, total_mpi_size
>       PetscInt nl3, code,  PROC_ROW, ISTATUS, jm, N_rows,NO_A_ENTRIES
>       PetscInt, ALLOCATABLE, DIMENSION(:) :: JALOC
>       PetscInt, PARAMETER  ::  ZERO = 0 , ONE = 1, TWO = 2, THREE = 3  
>       PetscErrorCode ierr_pets
>      
> ! Initialise sleps/mpi
>       call SlepcInitialize(PETSC_NULL_CHARACTER,ierr_pets) ! note that this initialises MPI
>       call MPI_COMM_SIZE(MPI_COMM_WORLD, total_mpi_size, ierr_pets) !! find total no of MPI processes  
>       nL3= total_mpi_size
>       call MPI_COMM_RANK(MPI_COMM_WORLD,rank3,ierr_pets) !! find my overall rank -> rank3
>       write(*,*)'Welcome: PROCESS NO , TOTAL NO. OF PROCESSES =  ',rank3, nl3
>  
>       N_rows = 200 ! NUMBER OF ROWS OF A NOTIONAL MATRIX
>       NO_A_ENTRIES = 12 ! NUMBER OF ENTRIES FOR JALOC
>      
> !     LOOP OVER ROWS      
>       do jm = 1, N_rows
>  
>       CALL whose_row_is_it(JM,  N_rows , NL3, PROC_ROW) ! FIND OUT WHICH PROCESS OWNS ROW
>       if (rank3 == PROC_ROW) then ! IF mpi PROCESS OWNS THIS ROW THEN ..
> !       ALLOCATE jaloc ARRAY AND INITIALISE
>        
>         allocate(jaloc(NO_A_ENTRIES), STAT=ISTATUS )
>         jaloc = three
>        
>        
>         WRITE(*,*)'JALOC',JALOC ! THIS SIMPLE PLOT ALWAYS WORKS
>         write(*,*)'calling PetscIntView: PROCESS NO. ROW NO.',rank3, jm
>         ! THIS CALL TO PetscIntView CAUSES CODE TO HANG WHEN E.G. total_mpi_size=3, JM=133
>         call PetscIntView(NO_A_ENTRIES,JALOC(1:NO_A_ENTRIES),              &
>      &       PETSC_VIEWER_STDOUT_WORLD, ierr_pets)
>         CHKERRA(ierr_pets)
>         deallocate(jaloc)
>       endif
>       enddo
> 
>       CALL SlepcFinalize(ierr_pets)
>       end program trialer



More information about the petsc-users mailing list