[petsc-users] Question about memory usage in Multigrid preconditioner

Thu Sep 15 01:03:39 CDT 2016

Hi Dave,

Sorry, I should have put more comment to explain the code.
The number of process in each dimension is the same: Px = Py=Pz=P. So is 
the domain size.
So if the you want to run the code for a  512^3 grid points on 16^3 
cores, you need to set "-N 512 -P 16" in the command line.
I add more comments and also fix an error in the attached code. ( The 
error only effects the accuracy of solution but not the memory usage. )

Thank you.
Frank

On 9/14/2016 9:05 PM, Dave May wrote:
>
>
> On Thursday, 15 September 2016, Dave May <dave.mayhem23 at gmail.com 
> <mailto:dave.mayhem23 at gmail.com>> wrote:
>
>
>
>     On Thursday, 15 September 2016, frank <hengjiew at uci.edu
>     <javascript:_e(%7B%7D,'cvml','hengjiew at uci.edu');>> wrote:
>
>         Hi,
>
>         I write a simple code to re-produce the error. I hope this can
>         help to diagnose the problem.
>         The code just solves a 3d poisson equation.
>
>
>     Why is the stencil width a runtime parameter?? And why is the
>     default value 2? For 7-pnt FD Laplace, you only need a stencil
>     width of 1.
>
>     Was this choice made to mimic something in the real application code?
>
>
> Please ignore - I misunderstood your usage of the param set by -P
>
>
>         I run the code on a 1024^3 mesh. The process partition is 32 *
>         32 * 32. That's when I re-produce the OOM error. Each core has
>         about 2G memory.
>         I also run the code on a 512^3 mesh with 16 * 16 * 16
>         processes. The ksp solver works fine.
>         I attached the code, ksp_view_pre's output and my petsc option
>         file.
>
>         Thank you.
>         Frank
>
>         On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>         Hi Barry,
>>
>>         I checked. On the supercomputer, I had the option
>>         "-ksp_view_pre" but it is not in file I sent you. I am sorry
>>         for the confusion.
>>
>>         Regards,
>>         Frank
>>
>>         On Friday, September 9, 2016, Barry Smith
>>         <bsmith at mcs.anl.gov> wrote:
>>
>>
>>             > On Sep 9, 2016, at 3:11 PM, frank <hengjiew at uci.edu> wrote:
>>             >
>>             > Hi Barry,
>>             >
>>             > I think the first KSP view output is from
>>             -ksp_view_pre. Before I submitted the test, I was not
>>             sure whether there would be OOM error or not. So I added
>>             both -ksp_view_pre and -ksp_view.
>>
>>               But the options file you sent specifically does NOT
>>             list the -ksp_view_pre so how could it be from that?
>>
>>                Sorry to be pedantic but I've spent too much time in
>>             the past trying to debug from incorrect information and
>>             want to make sure that the information I have is correct
>>             before thinking. Please recheck exactly what happened.
>>             Rerun with the exact input file you emailed if that is
>>             needed.
>>
>>                Barry
>>
>>             >
>>             > Frank
>>             >
>>             >
>>             > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>             >>   Why does ksp_view2.txt have two KSP views in it
>>             while ksp_view1.txt has only one KSPView in it? Did you
>>             run two different solves in the 2 case but not the one?
>>             >>
>>             >>   Barry
>>             >>
>>             >>
>>             >>
>>             >>> On Sep 9, 2016, at 10:56 AM, frank <hengjiew at uci.edu>
>>             wrote:
>>             >>>
>>             >>> Hi,
>>             >>>
>>             >>> I want to continue digging into the memory problem here.
>>             >>> I did find a work around in the past, which is to use
>>             less cores per node so that each core has 8G memory.
>>             However this is deficient and expensive. I hope to locate
>>             the place that uses the most memory.
>>             >>>
>>             >>> Here is a brief summary of the tests I did in past:
>>             >>>> Test1:   Mesh 1536*128*384  | Process Mesh 48*4*12
>>             >>> Maximum (over computational time) process memory:   
>>                    total 7.0727e+08
>>             >>> Current process memory:                              
>>                        total 7.0727e+08
>>             >>> Maximum (over computational time) space
>>             PetscMalloc()ed:  total 6.3908e+11
>>             >>> Current space PetscMalloc()ed:                      
>>                             total 1.8275e+09
>>             >>>
>>             >>>> Test2:    Mesh 1536*128*384  | Process Mesh 96*8*24
>>             >>> Maximum (over computational time) process memory:   
>>                    total 5.9431e+09
>>             >>> Current process memory:                              
>>                        total 5.9431e+09
>>             >>> Maximum (over computational time) space
>>             PetscMalloc()ed:  total 5.3202e+12
>>             >>> Current space PetscMalloc()ed:                      
>>                              total 5.4844e+09
>>             >>>
>>             >>>> Test3:    Mesh 3072*256*768  | Process Mesh 96*8*24
>>             >>>     OOM( Out Of Memory ) killer of the supercomputer
>>             terminated the job during "KSPSolve".
>>             >>>
>>             >>> I attached the output of ksp_view( the third test's
>>             output is from ksp_view_pre ), memory_view and also the
>>             petsc options.
>>             >>>
>>             >>> In all the tests, each core can access about 2G
>>             memory. In test3, there are 4223139840 non-zeros in the
>>             matrix. This will consume about 1.74M, using double
>>             precision. Considering some extra memory used to store
>>             integer index, 2G memory should still be way enough.
>>             >>>
>>             >>> Is there a way to find out which part of KSPSolve
>>             uses the most memory?
>>             >>> Thank you so much.
>>             >>>
>>             >>> BTW, there are 4 options remains unused and I don't
>>             understand why they are omitted:
>>             >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly
>>             >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi
>>             >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
>>             >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson
>>             >>>
>>             >>>
>>             >>> Regards,
>>             >>> Frank
>>             >>>
>>             >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>             >>>>
>>             >>>> On 14 July 2016 at 01:07, frank <hengjiew at uci.edu>
>>             wrote:
>>             >>>> Hi Dave,
>>             >>>>
>>             >>>> Sorry for the late reply.
>>             >>>> Thank you so much for your detailed reply.
>>             >>>>
>>             >>>> I have a question about the estimation of the memory
>>             usage. There are 4223139840 allocated non-zeros and 18432
>>             MPI processes. Double precision is used. So the memory
>>             per process is:
>>             >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
>>             >>>> Did I do sth wrong here? Because this seems too small.
>>             >>>>
>>             >>>> No - I totally f***ed it up. You are correct.
>>             That'll teach me for fumbling around with my iphone
>>             calculator and not using my brain. (Note that to convert
>>             to MB just divide by 1e6, not 1024^2 - although I
>>             apparently cannot convert between units correctly....)
>>             >>>>
>>             >>>> From the PETSc objects associated with the solver,
>>             It looks like it _should_ run with 2GB per MPI rank.
>>             Sorry for my mistake. Possibilities are: somewhere in
>>             your usage of PETSc you've introduced a memory leak;
>>             PETSc is doing a huge over allocation (e.g. as per our
>>             discussion of MatPtAP); or in your application code there
>>             are other objects you have forgotten to log the memory for.
>>             >>>>
>>             >>>>
>>             >>>>
>>             >>>> I am running this job on Bluewater
>>             >>>> I am using the 7 points FD stencil in 3D.
>>             >>>>
>>             >>>> I thought so on both counts.
>>             >>>>
>>             >>>> I apologize that I made a stupid mistake in
>>             computing the memory per core. My settings render each
>>             core can access only 2G memory on average instead of 8G
>>             which I mentioned in previous email. I re-run the job
>>             with 8G memory per core on average and there is no "Out
>>             Of Memory" error. I would do more test to see if there is
>>             still some memory issue.
>>             >>>>
>>             >>>> Ok. I'd still like to know where the memory was
>>             being used since my estimates were off.
>>             >>>>
>>             >>>>
>>             >>>> Thanks,
>>             >>>>   Dave
>>             >>>>
>>             >>>> Regards,
>>             >>>> Frank
>>             >>>>
>>             >>>>
>>             >>>>
>>             >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>             >>>>> Hi Frank,
>>             >>>>>
>>             >>>>>
>>             >>>>> On 11 July 2016 at 19:14, frank <hengjiew at uci.edu>
>>             wrote:
>>             >>>>> Hi Dave,
>>             >>>>>
>>             >>>>> I re-run the test using bjacobi as the
>>             preconditioner on the coarse mesh of telescope. The Grid
>>             is 3072*256*768 and process mesh is 96*8*24. The petsc
>>             option file is attached.
>>             >>>>> I still got the "Out Of Memory" error. The error
>>             occurred before the linear solver finished one step. So I
>>             don't have the full info from ksp_view. The info from
>>             ksp_view_pre is attached.
>>             >>>>>
>>             >>>>> Okay - that is essentially useless (sorry)
>>             >>>>>
>>             >>>>> It seems to me that the error occurred when the
>>             decomposition was going to be changed.
>>             >>>>>
>>             >>>>> Based on what information?
>>             >>>>> Running with -info would give us more clues, but
>>             will create a ton of output.
>>             >>>>> Please try running the case which failed with -info
>>             >>>>>  I had another test with a grid of 1536*128*384 and
>>             the same process mesh as above. There was no error. The
>>             ksp_view info is attached for comparison.
>>             >>>>> Thank you.
>>             >>>>>
>>             >>>>>
>>             >>>>> [3] Here is my crude estimate of your memory usage.
>>             >>>>> I'll target the biggest memory hogs only to get an
>>             order of magnitude estimate
>>             >>>>>
>>             >>>>> * The Fine grid operator contains 4223139840
>>             non-zeros --> 1.8 GB per MPI rank assuming double precision.
>>             >>>>> The indices for the AIJ could amount to another 0.3
>>             GB (assuming 32 bit integers)
>>             >>>>>
>>             >>>>> * You use 5 levels of coarsening, so the other
>>             operators should represent (collectively)
>>             >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per
>>             MPI rank on the communicator with 18432 ranks.
>>             >>>>> The coarse grid should consume ~ 0.5 MB per MPI
>>             rank on the communicator with 18432 ranks.
>>             >>>>>
>>             >>>>> * You use a reduction factor of 64, making the new
>>             communicator with 288 MPI ranks.
>>             >>>>> PCTelescope will first gather a temporary matrix
>>             associated with your coarse level operator assuming a
>>             comm size of 288 living on the comm with size 18432.
>>             >>>>> This matrix will require approximately 0.5 * 64 =
>>             32 MB per core on the 288 ranks.
>>             >>>>> This matrix is then used to form a new MPIAIJ
>>             matrix on the subcomm, thus require another 32 MB per rank.
>>             >>>>> The temporary matrix is now destroyed.
>>             >>>>>
>>             >>>>> * Because a DMDA is detected, a permutation matrix
>>             is assembled.
>>             >>>>> This requires 2 doubles per point in the DMDA.
>>             >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
>>             >>>>> Thus the permutation matrix will require < 1 MB per
>>             MPI rank on the sub-comm.
>>             >>>>>
>>             >>>>> * Lastly, the matrix is permuted. This uses
>>             MatPtAP(), but the resulting operator will have the same
>>             memory footprint as the unpermuted matrix (32 MB). At any
>>             stage in PCTelescope, only 2 operators of size 32 MB are
>>             held in memory when the DMDA is provided.
>>             >>>>>
>>             >>>>> From my rough estimates, the worst case memory foot
>>             print for any given core, given your options is approximately
>>             >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
>>             >>>>> This is way below 8 GB.
>>             >>>>>
>>             >>>>> Note this estimate completely ignores:
>>             >>>>> (1) the memory required for the restriction operator,
>>             >>>>> (2) the potential growth in the number of non-zeros
>>             per row due to Galerkin coarsening (I wished
>>             -ksp_view_pre reported the output from MatView so we
>>             could see the number of non-zeros required by the coarse
>>             level operators)
>>             >>>>> (3) all temporary vectors required by the CG
>>             solver, and those required by the smoothers.
>>             >>>>> (4) internal memory allocated by MatPtAP
>>             >>>>> (5) memory associated with IS's used within PCTelescope
>>             >>>>>
>>             >>>>> So either I am completely off in my estimates, or
>>             you have not carefully estimated the memory usage of your
>>             application code. Hopefully others might examine/correct
>>             my rough estimates
>>             >>>>>
>>             >>>>> Since I don't have your code I cannot access the
>>             latter.
>>             >>>>> Since I don't have access to the same machine you
>>             are running on, I think we need to take a step back.
>>             >>>>>
>>             >>>>> [1] What machine are you running on? Send me a URL
>>             if its available
>>             >>>>>
>>             >>>>> [2] What discretization are you using? (I am
>>             guessing a scalar 7 point FD stencil)
>>             >>>>> If it's a 7 point FD stencil, we should be able to
>>             examine the memory usage of your solver configuration
>>             using a standard, light weight existing PETSc example,
>>             run on your machine at the same scale.
>>             >>>>> This would hopefully enable us to correctly
>>             evaluate the actual memory usage required by the solver
>>             configuration you are using.
>>             >>>>>
>>             >>>>> Thanks,
>>             >>>>>   Dave
>>             >>>>>
>>             >>>>>
>>             >>>>> Frank
>>             >>>>>
>>             >>>>>
>>             >>>>>
>>             >>>>>
>>             >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>             >>>>>>
>>             >>>>>> On Saturday, 9 July 2016, frank <hengjiew at uci.edu>
>>             wrote:
>>             >>>>>> Hi Barry and Dave,
>>             >>>>>>
>>             >>>>>> Thank both of you for the advice.
>>             >>>>>>
>>             >>>>>> @Barry
>>             >>>>>> I made a mistake in the file names in last email.
>>             I attached the correct files this time.
>>             >>>>>> For all the three tests, 'Telescope' is used as
>>             the coarse preconditioner.
>>             >>>>>>
>>             >>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh:
>>             48*4*12
>>             >>>>>> Part of the memory usage: Vector   125           
>>             124 3971904     0.
>>             >>>>>>                   Matrix   101 101      9462372  0
>>             >>>>>>
>>             >>>>>> == Test2: Grid: 1536*128*384,   Process Mesh: 96*8*24
>>             >>>>>> Part of the memory usage: Vector   125           
>>             124 681672     0.
>>             >>>>>>                   Matrix   101 101      1462180  0.
>>             >>>>>>
>>             >>>>>> In theory, the memory usage in Test1 should be 8
>>             times of Test2. In my case, it is about 6 times.
>>             >>>>>>
>>             >>>>>> == Test3: Grid: 3072*256*768,   Process Mesh:
>>             96*8*24. Sub-domain per process: 32*32*32
>>             >>>>>> Here I get the out of memory error.
>>             >>>>>>
>>             >>>>>> I tried to use -mg_coarse jacobi. In this way, I
>>             don't need to set -mg_coarse_ksp_type and
>>             -mg_coarse_pc_type explicitly, right?
>>             >>>>>> The linear solver didn't work in this case. Petsc
>>             output some errors.
>>             >>>>>>
>>             >>>>>> @Dave
>>             >>>>>> In test3, I use only one instance of 'Telescope'.
>>             On the coarse mesh of 'Telescope', I used LU as the
>>             preconditioner instead of SVD.
>>             >>>>>> If my set the levels correctly, then on the last
>>             coarse mesh of MG where it calls 'Telescope', the
>>             sub-domain per process is 2*2*2.
>>             >>>>>> On the last coarse mesh of 'Telescope', there is
>>             only one grid point per process.
>>             >>>>>> I still got the OOM error. The detailed petsc
>>             option file is attached.
>>             >>>>>>
>>             >>>>>> Do you understand the expected memory usage for
>>             the particular parallel LU implementation you are using?
>>             I don't (seriously). Replace LU with bjacobi and re-run
>>             this test. My point about solver debugging is still valid.
>>             >>>>>>
>>             >>>>>> And please send the result of KSPView so we can
>>             see what is actually used in the computations
>>             >>>>>>
>>             >>>>>> Thanks
>>             >>>>>>   Dave
>>             >>>>>>
>>             >>>>>>
>>             >>>>>> Thank you so much.
>>             >>>>>>
>>             >>>>>> Frank
>>             >>>>>>
>>             >>>>>>
>>             >>>>>>
>>             >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>             >>>>>> On Jul 6, 2016, at 4:19 PM, frank
>>             <hengjiew at uci.edu> wrote:
>>             >>>>>>
>>             >>>>>> Hi Barry,
>>             >>>>>>
>>             >>>>>> Thank you for you advice.
>>             >>>>>> I tried three test. In the 1st test, the grid is
>>             3072*256*768 and the process mesh is 96*8*24.
>>             >>>>>> The linear solver is 'cg' the preconditioner is
>>             'mg' and 'telescope' is used as the preconditioner at the
>>             coarse mesh.
>>             >>>>>> The system gives me the "Out of Memory" error
>>             before the linear system is completely solved.
>>             >>>>>> The info from '-ksp_view_pre' is attached. I seems
>>             to me that the error occurs when it reaches the coarse mesh.
>>             >>>>>>
>>             >>>>>> The 2nd test uses a grid of 1536*128*384 and
>>             process mesh is 96*8*24. The 3rd                        
>>                              test uses the same grid but a different
>>             process mesh 48*4*12.
>>             >>>>>>     Are you sure this is right? The total matrix
>>             and vector memory usage goes from 2nd test
>>             >>>>>>                Vector  384            383     
>>             8,193,712     0.
>>             >>>>>>                Matrix  103            103   
>>              11,508,688     0.
>>             >>>>>> to 3rd test
>>             >>>>>>               Vector   384           383     
>>             1,590,520     0.
>>             >>>>>>                Matrix  103            103     
>>             3,508,664     0.
>>             >>>>>> that is the memory usage got smaller but if you
>>             have only 1/8th the processes and the same grid it should
>>             have gotten about 8 times bigger. Did you maybe cut the
>>             grid by a factor of 8 also? If so that still doesn't
>>             explain it because the memory usage changed by a factor
>>             of 5 something for the vectors and 3 something for the
>>             matrices.
>>             >>>>>>
>>             >>>>>>
>>             >>>>>> The linear solver and petsc options in 2nd and 3rd
>>             tests are the same in 1st test. The linear solver works
>>             fine in both test.
>>             >>>>>> I attached the memory usage of the 2nd and 3rd
>>             tests. The memory info is from the option '-log_summary'.
>>             I tried to use '-momery_info' as you suggested, but in my
>>             case petsc treated it as an unused option. It output
>>             nothing about the memory. Do I need to add sth to my code
>>             so I can use '-memory_info'?
>>             >>>>>>     Sorry, my mistake the option is -memory_view
>>             >>>>>>
>>             >>>>>>    Can you run the one case with -memory_view and
>>             -mg_coarse jacobi -ksp_max_it 1 (just so it doesn't
>>             iterate forever) to see how much memory is used without
>>             the telescope? Also run case 2 the same way.
>>             >>>>>>
>>             >>>>>>    Barry
>>             >>>>>>
>>             >>>>>>
>>             >>>>>>
>>             >>>>>> In both tests the memory usage is not large.
>>             >>>>>>
>>             >>>>>> It seems to me that it might be the 'telescope' 
>>             preconditioner that allocated a lot of memory and caused
>>             the error in the 1st test.
>>             >>>>>> Is there is a way to show how much memory it
>>             allocated?
>>             >>>>>>
>>             >>>>>> Frank
>>             >>>>>>
>>             >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
>>             >>>>>>    Frank,
>>             >>>>>>
>>             >>>>>>      You can run with -ksp_view_pre to have it
>>             "view" the KSP before the solve so hopefully it gets that
>>             far.
>>             >>>>>>
>>             >>>>>>       Please run the problem that does fit with
>>             -memory_info when the problem completes it will show the
>>             "high water mark" for PETSc allocated memory and total
>>             memory used. We first want to look at these numbers to
>>             see if it is using more memory than you expect. You could
>>             also run with say half the grid spacing to see how the
>>             memory usage scaled with the increase in grid points.
>>             Make the runs also with -log_view and send all the output
>>             from these options.
>>             >>>>>>
>>             >>>>>>     Barry
>>             >>>>>>
>>             >>>>>> On Jul 5, 2016, at 5:23 PM, frank
>>             <hengjiew at uci.edu> wrote:
>>             >>>>>>
>>             >>>>>> Hi,
>>             >>>>>>
>>             >>>>>> I am using the CG ksp solver and Multigrid
>>             preconditioner  to solve a linear system in parallel.
>>             >>>>>> I chose to use the 'Telescope' as the
>>             preconditioner on the coarse mesh for its good performance.
>>             >>>>>> The petsc options file is attached.
>>             >>>>>>
>>             >>>>>> The domain is a 3d box.
>>             >>>>>> It works well when the grid is  1536*128*384 and
>>             the process mesh is 96*8*24. When I double the size of
>>             grid and                                keep the same
>>             process mesh and petsc options, I get an "out of memory"
>>             error from the super-cluster I am using.
>>             >>>>>> Each process has access to at least 8G memory,
>>             which should be more than enough for my application. I am
>>             sure that all the other parts of my code( except the
>>             linear solver ) do not use much memory. So I doubt if
>>             there is something wrong with the linear solver.
>>             >>>>>> The error occurs before the linear system is
>>             completely solved so I don't have the info from ksp view.
>>             I am not able to re-produce the error with a smaller
>>             problem either.
>>             >>>>>> In addition,  I tried to use the block jacobi as
>>             the preconditioner with the same grid and same
>>             decomposition. The linear solver runs extremely slow but
>>             there is no memory error.
>>             >>>>>>
>>             >>>>>> How can I diagnose what exactly cause the error?
>>             >>>>>> Thank you so much.
>>             >>>>>>
>>             >>>>>> Frank
>>             >>>>>> <petsc_options.txt>
>>             >>>>>>
>>             <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
>>             >>>>>>
>>             >>>>>
>>             >>>>
>>             >>>
>>             <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
>>             >
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160914/0bb85b33/attachment-0001.html>
-------------- next part --------------
! Purpose: Test ksp solver with a 3d poisson eqn.
!
! Discrip: The domain is a periodic cube. The exact solution is sin(x)*sin(y)*sin(z).
!          The eqn is discretized with 2nd order central difference.
!          The rhs of the eqn would be -3*sin(x)*(y)*sin(z) * (-h**2), where h is the space step.
!          In the matrix, the diagram element is 6 and the other 6 non-zero elements in the row is -1.
!
! Usage:   The length of the cube is set by command line input.
!          The number of process in each dimension is the same and also set by command line input.
!          To run the code on 512^3 grid points and 16^3 cores: 
!                 mpirun -n 4096 ./test_ksp.exe -N 512 -P 16
!          Petsc options are passed in a file "pestc_options.txt" in the working directory.
!
! Author:  Frank
!          hengjiew at uci.edu
!
! Date:    09/14/2016

PROGRAM test_ksp

#include <petsc/finclude/petscdmdef.h> 
#include <petsc/finclude/petscmatdef.h>
#include <petsc/finclude/petscvecdef.h>
#include <petsc/finclude/petsckspdef.h>

  USE petscdmda
  USE petscsys

  IMPLICIT NONE

  INTEGER, PARAMETER :: pr = 8
  CHARACTER(*), PARAMETER :: petsc_options_file="petsc_options.txt"

  DM                :: decomp
  DMBoundaryType    :: BType(3)
  INTEGER           :: N                          ! length of cube 
  integer           :: P                          ! process # in each dimension
  INTEGER           :: is, js ,ks, nxl, nyl, nzl  ! subdomain index
  INTEGER           :: rank
  INTEGER           :: i, j, k
  REAL(pr)          :: h                          ! space step
  Vec               :: x,b                        ! solution and rhs
  REAL(pr), POINTER :: ptr_vec(:,:,:) => NULL()   ! pointer to petsc vector
  Mat               :: A
  MatNullSpace      :: nullspace
  MatStencil        :: row(4), col(4,7)
  PetscInt          :: i1 = 1, i7 = 7
  REAL(pr)          :: value(7)
  KSP               :: ksp
  PetscErrorCode    :: ierr
  REAL(pr)          :: q2 = 0.5_pr
  REAL(pr)          :: erri = 0, err1 = 0
  LOGICAL           :: is_converged, is_set

  CALL PetscInitialize( petsc_options_file, ierr ) 
  CALL MPI_Comm_rank( PETSC_COMM_WORLD, rank, ierr )

  ! default values
  N     = 64
  P     = 2
  BType = DM_BOUNDARY_PERIODIC

  ! read domain size and process# from command line
  CALL PetscOptionsGetInt( PETSC_NULL_OBJECT, PETSC_NULL_CHARACTER, '-N', N, is_set, ierr )
  CALL PetscOptionsGetInt( PETSC_NULL_OBJECT, PETSC_NULL_CHARACTER, '-P', P, is_set, ierr )

  CALL DMDACreate3d( PETSC_COMM_WORLD, BType(1), BType(2), BType(3), &
                   & DMDA_STENCIL_STAR, N, N, N, P, P, P, 1, 1, &
                   & PETSC_NULL_INTEGER,  PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, &
                   & decomp, ierr ) 
  CALL DMDAGetCorners( decomp, is, js, ks, nxl, nyl, nzl, ierr) 

  !WRITE(*,'(7I6)') rank, is, js, ks, nxl, nyl, nzl

  ! create vector for rhs and solution
  CALL DMCreateGlobalVector( decomp, b, ierr )
  CALL VecDuplicate( b, x, ierr )
  CALL VecSet( x, 0.0_pr, ierr )

  ! create matrix
  CALL DMSetMatType( decomp, MATAIJ, ierr )
  CALL DMCreateMatrix( decomp, A, ierr )

  ! create ksp solver
  CALL KSPCreate( PETSC_COMM_WORLD, ksp, ierr ) 
  CALL KSPSetDM( ksp, decomp, ierr )
  CALL KSPSetDMActive( ksp, PETSC_FALSE, ierr )
  CALL KSPSetFromOptions( ksp, ierr )

  ! create nullspace, periodic bc give the matrix a nullspace 
  CALL MatNullSpaceCreate( PETSC_COMM_WORLD, PETSC_TRUE, PETSC_NULL_INTEGER, &
                         & PETSC_NULL_INTEGER, nullspace, ierr )

  ! set rhs
  CALL DMDAVecGetArrayF90( decomp, b, ptr_vec, ierr )
  DO i = is, is+nxl-1
    DO j = js, js+nyl-1
      DO k = ks, ks+nzl-1
        ptr_vec(i,j,k) = -3 * SIN((i+q2)*h) * SIN((j+q2)*h) * SIN((k+q2)*h)
      END DO
    END DO
  END DO
  h       = 1.0_pr / N
  ptr_vec = - h**2 * ptr_vec

  ! assembly rhs
  CALL VecAssemblyBegin( b, ierr )
  CALL VecAssemblyEnd( b, ierr )
  CALL DMDAVecRestoreArrayF90( decomp, b, ptr_vec, ierr )

  ! set matrix
  DO i = is, is+nxl-1
    DO j = js, js+nyl-1
      DO k = ks, ks+nzl-1

        ! row index of current point
        row(MatStencil_i) = i
        row(MatStencil_j) = j
        row(MatStencil_k) = k   

        ! column index of current point and its neighbors
        col(MatStencil_i,1) = i;    col(MatStencil_j,1) = j;    col(MatStencil_k,1) = k
        col(MatStencil_i,2) = i-1;  col(MatStencil_j,2) = j;    col(MatStencil_k,2) = k     
        col(MatStencil_i,3) = i+1;  col(MatStencil_j,3) = j;    col(MatStencil_k,3) = k 
        col(MatStencil_i,4) = i;    col(MatStencil_j,4) = j-1;  col(MatStencil_k,4) = k 
        col(MatStencil_i,5) = i;    col(MatStencil_j,5) = j+1;  col(MatStencil_k,5) = k 
        col(MatStencil_i,6) = i;    col(MatStencil_j,6) = j;    col(MatStencil_k,6) = k-1 
        col(MatStencil_i,7) = i;    col(MatStencil_j,7) = j;    col(MatStencil_k,7) = k+1 

        ! set values at current point and its neighbors
        value(1 ) =  6.0_pr
        value(2:) = -1.0_pr

        ! set the matrix's elements
         CALL MatSetValuesStencil( A, i1, row, i7, col, value, INSERT_VALUES, ierr ) 

      END DO
    END DO
  END DO

  ! assembly the matrix
  CALL MatAssemblyBegin( A, MAT_FINAL_ASSEMBLY, ierr )
  CALL MatAssemblyEnd( A, MAT_FINAL_ASSEMBLY, ierr )

  ! remove the nullspace
  CALL MatSetNullSpace( A, nullspace, ierr )
  CALL MatNullSpaceRemove( nullspace, b, PETSC_NULL_OBJECT, ierr )

  ! Solve system
  CALL KSPSetOperators( ksp, A, A, ierr )
  !CALL KSPSetReusePreconditioner( ksp, reuse_pc, ierr )
  CALL KSPSolve( ksp, b, x, ierr )
  !CALL KSPGetIterationNumber( ksp, niter, ierr ) 
  !CALL KSPGetConvergedReason( ksp, is_converged, ierr )

  ! get solution
  CALL VecAssemblyBegin( x, ierr )
  CALL VecAssemblyEnd( x, ierr )
  CALL DMDAVecGetArrayF90( decomp, x, ptr_vec, ierr )

  ! check the error
  DO i = is, is+nxl-1
    DO j = js, js+nyl-1
      DO k = ks, ks+nzl-1
        erri = MAX(erri, ABS(ptr_vec(i,j,k) - SIN((i+q2)*h)*SIN((j+q2)*h)*SIN((k+q2)*h)))
        err1 = err1 + ABS(ptr_vec(i,j,k) - SIN((i+q2)*h)*SIN((j+q2)*h)*SIN((k+q2)*h))
      END DO
    END DO
  END DO
  CALL DMDAVecRestoreArrayF90( decomp, x, ptr_vec, ierr )

  CALL MPI_Allreduce( erri, MPI_IN_PLACE, 1, MPI_REAL8, MPI_MAX, PETSC_COMM_WORLD, ierr )
  CALL MPI_Allreduce( err1, MPI_IN_PLACE, 1, MPI_REAL8, MPI_SUM, PETSC_COMM_WORLD, ierr )

  IF( rank == 0 ) THEN 
    PRINT*, 'norm1 error: ', err1 / N**3
    PRINT*, 'norm inf error:', erri
  END IF

  CALL VecDestroy( x, ierr )
  CALL VecDestroy( b, ierr ) 
  CALL MatDestroy( A, ierr )
  CALL KSPDestroy( ksp, ierr )

END PROGRAM test_ksp