[petsc-users] snes/ex19 issue with nvprof

Smith, Barry F. bsmith at mcs.anl.gov
Sat Jul 6 23:29:40 CDT 2019


   err = cudaMemcpy(((Vec_Seq*)v->data)->array,varray,v->map->n*sizeof(PetscScalar),cudaMemcpyDeviceToHost);CHKERRCUDA(err);

can you run in the debugger at the same time as using this tool (Seems like you should be able to)? You could then check the value of v->map->n and 
((Vec_Seq*)v->data)->array and varray to see if they are all reasonable. Is it, for example, choking on v->map->n being zero or are either of the arrays garbage values or all three? For example if n is zero it may still be checking if the two arrays are reasonable even though they may not be since no data needs to be copied.

  Barry

  If cudaMemcpy() cannot handle n of 0 gracefully for example we may need to protect it and only call when n > 0




> On Jul 6, 2019, at 10:35 PM, Xiangdong via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> Hi Mark,
> 
> For your latest commit 9ab9e86e1e, I can get the normal run working by either commenting out all PetscCheckTypeNames in veccuda2.cu or use "-dm_vec_type cuda" instead of "-vec_type cuda".
> 
> However, even with your latest commit, I still got an error when using nvprof (different lines reported). I checked that I did use same cuda driver and library (cuda_10.1.168_418.67_linux).
> 
>  eps at new:~/MyCodes/petsctest$ nvprof --profile-child-processes mpiexec -np 1 ./extest -da_refine 5 -snes_view -snes_monitor -ksp_monitor -mat_type aijcusparse -dm_vec_type cuda
> ==8902== NVPROF is profiling process 8902, command: ./extest -da_refine 5 -snes_view -snes_monitor -ksp_monitor -mat_type aijcusparse -dm_vec_type cuda
> lid velocity = 0.000106281, prandtl # = 1., grashof # = 1.
> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [0]PETSC ERROR: Error in external library
> [0]PETSC ERROR: CUDA error 700
> [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1103-g9ab9e86e1e  GIT Date: 2019-07-06 17:40:27 -0400
> [0]PETSC ERROR: ./extest on a arch-debug-origin named new by eps Sat Jul  6 23:24:58 2019
> [0]PETSC ERROR: Configure options PETSC_ARCH=arch-debug-origin --with-debugging=1 --with-mpi-dir=/home/eps/MyLocal/mpi/mpich --with-cuda=1 --download-fblaslapack
> [0]PETSC ERROR: #1 VecCUDACopyFromGPU() line 131 in /home/eps/MyLocal/petsc-latest/src/vec/vec/impls/seq/seqcuda/veccuda2.c
}

> [0]PETSC ERROR: #2 VecGetArray() line 1556 in /home/eps/MyLocal/petsc-latest/src/vec/vec/interface/rvector.c
> [0]PETSC ERROR: #3 VecGetArray2d() line 2185 in /home/eps/MyLocal/petsc-latest/src/vec/vec/interface/rvector.c
> [0]PETSC ERROR: #4 DMDAVecGetArray() line 68 in /home/eps/MyLocal/petsc-latest/src/dm/impls/da/dagetarray.c
> [0]PETSC ERROR: #5 FormInitialGuess() line 223 in /home/eps/MyCodes/petsctest/extest.c
> [0]PETSC ERROR: #6 main() line 159 in /home/eps/MyCodes/petsctest/extest.c
> [0]PETSC ERROR: PETSc Option Table entries:
> [0]PETSC ERROR: -da_refine 5
> [0]PETSC ERROR: -dm_vec_type cuda
> [0]PETSC ERROR: -ksp_monitor
> [0]PETSC ERROR: -mat_type aijcusparse
> [0]PETSC ERROR: -snes_monitor
> [0]PETSC ERROR: -snes_view
> [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov----------
> application called MPI_Abort(MPI_COMM_WORLD, 76) - process 0
> ==8902== Profiling application: ./extest -da_refine 5 -snes_view -snes_monitor -ksp_monitor -mat_type aijcusparse -dm_vec_type cuda
> ==8902== Profiling result:
> No kernels were profiled.
> No API activities were profiled.
> ==8902== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
> ======== Error: Application returned non-zero code 76
> 
> Thank you.
> 
> Best,
> Xiangdong
> 
> On Sat, Jul 6, 2019 at 5:51 PM Mark Adams <mfadams at lbl.gov> wrote:
> I'm using CUDA 10.1.105 and this looks like a logic bug and not a CUDA problem.
> 
> I added two MatSetType calls that I thought were the right thing to do. Your error is a type problem and test with DMPlex so the matrix creation goes through a different code path.
> 
> You could try removing them and see if it works.
> 
> That is really all there is in this branch at this point other than some ViennaCL stuff.
> 
> I would verify that a fresh pull from master works. I would also reconfigure. I have just pulled from master and pushed so you can do a fresh pull.
> 
> 
> On Sat, Jul 6, 2019 at 4:07 PM Xiangdong <epscodes at gmail.com> wrote:
> Hi Mark,
> 
> Yes, I am using your branch (mark/gamg-fix-viennacl-rebased) with Petsc Development GIT revision: v3.11.3-1064-g8b74309384. I compiled PETSc with CUDA 10.1.
> 
> Which version of CUDA are you using?
> 
> Thanks.
> 
> Xiangdong
> 
> On Sat, Jul 6, 2019 at 1:43 PM Mark Adams <mfadams at lbl.gov> wrote:
> I am not able to reproduce this error. THis is what I get. Are you pulling from mark/gamg-fix-viennacl-rebased ?
> 
> 13:38  /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 4 -a 4 -g 1 ./ex19 -da_refine 5 -snes_view -snes_monitor -ksp_monitor -mat_type aijcusparse -vec_type cuda -log_view
> lid velocity = 0.000106281, prandtl # = 1., grashof # = 1.
>   0 SNES Function norm 1.036007954337e-02 
>     0 KSP Residual norm 9.144944502871e-02 
>     1 KSP Residual norm 2.593759906204e-02 
>     2 KSP Residual norm 1.669815200495e-02 
>     3 KSP Residual norm 1.510777951698e-02 
>     4 KSP Residual norm 1.458401237884e-02 
>     5 KSP Residual norm 1.418635322926e-02 
>     6 KSP Residual norm 1.377436725003e-02 
>     7 KSP Residual norm 1.332236907186e-02 
>     8 KSP Residual norm 1.288602527920e-02 
>     9 KSP Residual norm 1.240018288138e-02 
>    10 KSP Residual norm 1.186798872492e-02 
> 
> 
> On Fri, Jul 5, 2019 at 11:07 PM Xiangdong <epscodes at gmail.com> wrote:
> Hi Mark,
> 
> After pulling your new commit, I got an error when running ex19 (even without nvprof). Same error with version 3.11.3. But ex19 run fine with old commit 64edb9194a9.
> 
> Below is a run time log error file. By the way, which version of cuda does PETSc support?
> 
> Thank you.
> 
> Xiangdong
> 
> lid velocity = 0.000106281, prandtl # = 1., grashof # = 1.
>   0 SNES Function norm 1.036007954337e-02 
>     0 KSP Residual norm 9.158537890035e-02 
> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [0]PETSC ERROR: Invalid argument
> [0]PETSC ERROR: Object (seq) is not seqcuda or mpicuda
> [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> [0]PETSC ERROR: Petsc Development GIT revision: v3.11.3-1064-g8b74309384  GIT Date: 2019-07-03 10:24:59 -0400
> [0]PETSC ERROR: ./extest on a arch-debug2 named new by eps Fri Jul  5 22:57:02 2019
> [0]PETSC ERROR: Configure options PETSC_ARCH=arch-debug2 --with-debugging=1 --with-mpi-dir=/home/eps/MyLocal/mpi/mpich --with-blaslapack-dir=/home/eps/MyLocal/intel/mkl --with-cuda=1
> [0]PETSC ERROR: #1 VecCUDAGetArrayRead() line 1304 in /home/eps/MyLocal/petsc-mark/src/vec/vec/impls/seq/seqcuda/veccuda2.cu
> [0]PETSC ERROR: #2 MatMultAdd_SeqAIJCUSPARSE() line 1415 in /home/eps/MyLocal/petsc-mark/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu
> [0]PETSC ERROR: #3 MatMult_SeqAIJCUSPARSE() line 1347 in /home/eps/MyLocal/petsc-mark/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu
> [0]PETSC ERROR: #4 MatMult() line 2368 in /home/eps/MyLocal/petsc-mark/src/mat/interface/matrix.c
> [0]PETSC ERROR: #5 PCApplyBAorAB() line 662 in /home/eps/MyLocal/petsc-mark/src/ksp/pc/interface/precon.c
> [0]PETSC ERROR: #6 KSP_PCApplyBAorAB() line 309 in /home/eps/MyLocal/petsc-mark/include/petsc/private/kspimpl.h
> [0]PETSC ERROR: #7 KSPGMRESCycle() line 152 in /home/eps/MyLocal/petsc-mark/src/ksp/ksp/impls/gmres/gmres.c
> [0]PETSC ERROR: #8 KSPSolve_GMRES() line 237 in /home/eps/MyLocal/petsc-mark/src/ksp/ksp/impls/gmres/gmres.c
> [0]PETSC ERROR: #9 KSPSolve() line 764 in /home/eps/MyLocal/petsc-mark/src/ksp/ksp/interface/itfunc.c
> [0]PETSC ERROR: #10 SNESSolve_NEWTONLS() line 225 in /home/eps/MyLocal/petsc-mark/src/snes/impls/ls/ls.c
> [0]PETSC ERROR: #11 SNESSolve() line 4430 in /home/eps/MyLocal/petsc-mark/src/snes/interface/snes.c
> [0]PETSC ERROR: #12 main() line 161 in /home/eps/MyCodes/petsctest/extest.c
> [0]PETSC ERROR: PETSc Option Table entries:
> [0]PETSC ERROR: -da_refine 5
> [0]PETSC ERROR: -ksp_monitor
> [0]PETSC ERROR: -mat_type aijcusparse
> [0]PETSC ERROR: -snes_monitor
> [0]PETSC ERROR: -snes_view
> [0]PETSC ERROR: -vec_type cuda
> [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov----------
> application called MPI_Abort(MPI_COMM_WORLD, 62) - process 0
> 
> On Wed, Jul 3, 2019 at 10:27 AM Mark Adams <mfadams at lbl.gov> wrote:
> My branch mark/gamg-fix-viennacl-rebased fixes this problem for me. Please give it a try.
> Thanks,
> Mark
> 
> On Tue, Jul 2, 2019 at 10:41 PM Xiangdong via petsc-users <petsc-users at mcs.anl.gov> wrote:
> Hello everyone,
> 
> When I run the ex19 with cuda like this:
>  mpiexec -np 4 ./ex19 -da_refine 5 -snes_view -snes_monitor -ksp_monitor -mat_type aijcusparse -vec_type cuda  -log_view
> 
> it worked fine and produced correct results.
> 
> However, when I tried to run this example with nvprof:
> nvprof --profile-child-processes mpiexec -np 4 ./extest -da_refine 5 -snes_view -snes_monitor -ksp_monitor -mat_type aijcusparse -vec_type cuda  -log_view
> 
> I got errors like:
> [3]PETSC ERROR: Error in external library
> [3]PETSC ERROR: CUDA error 700
> [3]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> [3]PETSC ERROR: Petsc Release Version 3.11.2, unknown 
> [3]PETSC ERROR: ./ex19 on a arch-opt named new by eps Tue Jul  2 22:26:01 2019
> [3]PETSC ERROR: Configure options PETSC_ARCH=arch-opt --with-debugging=0 --with-mpi-dir=/home/eps/MyLocal/mpi/mpich --with-blaslapack-dir=/home/eps/MyLocal/intel/mkl --with-cuda-dir=/home/eps/MyLocal/cuda --with-hypre-dir=/home/eps/MyLocal/hypre-2.15.1/hypre-install --download-hdf5=1
> [3]PETSC ERROR: #1 VecSet_SeqCUDA() line 785 in /home/eps/MyLocal/petsc/src/vec/vec/impls/seq/seqcuda/veccuda2.cu
> [3]PETSC ERROR: #2 VecSet() line 547 in /home/eps/MyLocal/petsc/src/vec/vec/interface/rvector.c
> [3]PETSC ERROR: #3 VecCreate_MPICUDA() line 178 in /home/eps/MyLocal/petsc/src/vec/vec/impls/mpi/mpicuda/mpicuda.cu
> [3]PETSC ERROR: #4 VecSetType() line 51 in /home/eps/MyLocal/petsc/src/vec/vec/interface/vecreg.c
> [3]PETSC ERROR: #5 VecCreate_CUDA() line 192 in /home/eps/MyLocal/petsc/src/vec/vec/impls/mpi/mpicuda/mpicuda.cu
> [3]PETSC ERROR: #6 VecSetType() line 51 in /home/eps/MyLocal/petsc/src/vec/vec/interface/vecreg.c
> [3]PETSC ERROR: #7 MatCreateVecs() line 8996 in /home/eps/MyLocal/petsc/src/mat/interface/matrix.c
> [3]PETSC ERROR: #8 MatFDColoringCreate() line 482 in /home/eps/MyLocal/petsc/src/mat/matfd/fdmatrix.c
> [3]PETSC ERROR: #9 SNESComputeJacobian_DMDA() line 175 in /home/eps/MyLocal/petsc/src/snes/utils/dmdasnes.c
> [3]PETSC ERROR: #10 SNESComputeJacobian() line 2718 in /home/eps/MyLocal/petsc/src/snes/interface/snes.c
> [3]PETSC ERROR: #11 SNESSolve_NEWTONLS() line 222 in /home/eps/MyLocal/petsc/src/snes/impls/ls/ls.c
> [3]PETSC ERROR: #12 SNESSolve() line 4560 in /home/eps/MyLocal/petsc/src/snes/interface/snes.c
> [3]PETSC ERROR: #13 main() line 161 in /home/eps/MyCodes/petsctest/extest.c
> 
> The full run log is attached. 
> 
> I am using  NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1.
> 
> I do not know why it is okay without nvprof but crashed with nvprof. Any suggestion to fix this?
> 
> Thank you.
> 
> Best,
> Xiangdong
> 



More information about the petsc-users mailing list