[petsc-dev] cuda with kokkos-cuda build fail

Mark Adams mfadams at lbl.gov
Fri Jan 7 14:19:53 CST 2022


And nvidia is OK:

12:16 nid002872 main= perlmutter:~/petsc$ make
PETSC_DIR=/global/homes/m/madams/petsc
PETSC_ARCH=arch-perlmutter-opt-nvidia-kokkos-cuda -f gmakefile test
search='snes_tutorials-ex19_cuda'
Using MAKEFLAGS: -- search=snes_tutorials-ex19_cuda
PETSC_ARCH=arch-perlmutter-opt-nvidia-kokkos-cuda
PETSC_DIR=/global/homes/m/madams/petsc
          CC
arch-perlmutter-opt-nvidia-kokkos-cuda/tests/snes/tutorials/ex19.o
     CLINKER
arch-perlmutter-opt-nvidia-kokkos-cuda/tests/snes/tutorials/ex19
        TEST
arch-perlmutter-opt-nvidia-kokkos-cuda/tests/counts/snes_tutorials-ex19_cuda.counts
 ok snes_tutorials-ex19_cuda
 ok diff-snes_tutorials-ex19_cuda
12:17 nid002872 main= perlmutter:~/petsc$



On Fri, Jan 7, 2022 at 2:23 PM Mark Adams <mfadams at lbl.gov> wrote:

> And it looks universal:
>
> 11:21 nid001544 main= perlmutter:~/petsc$ make
> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda -f gmakefile test search='
> *snes_tutorials-ex19_cuda*'
> Using MAKEFLAGS: -- search=snes_tutorials-ex19_cuda
> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>           CC
> arch-perlmutter-opt-gcc-kokkos-cuda/tests/snes/tutorials/ex19.o
>      CLINKER arch-perlmutter-opt-gcc-kokkos-cuda/tests/snes/tutorials/ex19
>         TEST
> arch-perlmutter-opt-gcc-kokkos-cuda/tests/counts/snes_tutorials-ex19_cuda.counts
> # retrying snes_tutorials-ex19_cuda
> not ok snes_tutorials-ex19_cuda # Error code: 97
> # lid velocity = 0.0625, prandtl # = 1., grashof # = 1.
> # [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> # [0]PETSC ERROR: GPU error
> # [0]PETSC ERROR: cuBLAS error 13 (CUBLAS_STATUS_EXECUTION_FAILED)
> # [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> # [0]PETSC ERROR: Petsc Development GIT revision: v3.16.3-511-g96172674f3
>  GIT Date: 2022-01-06 23:44:32 +0000
> # [0]PETSC ERROR:
> /global/u2/m/madams/petsc/arch-perlmutter-opt-gcc-kokkos-cuda/tests/snes/tutorials/runex19_cuda/../ex19
> on a arch-perlmutter-opt-gcc-kokkos-cuda named nid001544 by madams Fri Jan
>  7 11:22:35 2022
> # [0]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
> -rdynamic -DLANDAU_DIM=2 -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4"
> --with-cc=cc --with-cxx=CC --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
> --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
> --with-zlib=1 --download-kokkos --download-kokkos-kernels
> --with-kokkos-kernels-tpl=0 --with-make-np=8
> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
> # [0]PETSC ERROR: #1 VecNorm_SeqCUDA() at
> /global/u2/m/madams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda2.cu:994
> # [0]PETSC ERROR: #2 VecNorm() at
> /global/u2/m/madams/petsc/src/vec/vec/interface/rvector.c:228
> # [0]PETSC ERROR: #3 SNESSolve_NEWTONLS() at
> /global/u2/m/madams/petsc/src/snes/impls/ls/ls.c:179
> # [0]PETSC ERROR: #4 SNESSolve() at
> /global/u2/m/madams/petsc/src/snes/interface/snes.c:4810
> # [0]PETSC ERROR: #5 main() at
> /global/u2/m/madams/petsc/src/snes/tutorials/ex19.c:159
> # [0]PETSC ERROR: PETSc Option Table entries:
> # [0]PETSC ERROR: -check_pointer_intensity 0
> # [0]PETSC ERROR: -dm_mat_type aijcusparse
> # [0]PETSC ERROR: -dm_vec_type cuda
> # [0]PETSC ERROR: -error_output_stdout
> # [0]PETSC ERROR: -ksp_type fgmres
> # [0]PETSC ERROR: -malloc_dump
> # [0]PETSC ERROR: -nox
> # [0]PETSC ERROR: -nox_warning
> # [0]PETSC ERROR: -pc_type none
> # [0]PETSC ERROR: -snes_monitor_short
> # [0]PETSC ERROR: -snes_rtol 1.e-5
> # [0]PETSC ERROR: -use_gpu_aware_mpi 0
> # [0]PETSC ERROR: ----------------End of Error Message -------send entire
> error message to petsc-maint at mcs.anl.gov----------
> # MPICH Notice [Rank 0] [job id 1041592.1] [Fri Jan  7 11:22:36 2022]
> [nid001544] - Abort(97) (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 97) - process 0
> #
> # Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
> # srun: error: nid001544: task 0: Exited with exit code 97
> # srun: launch/slurm: _step_signal: Terminating StepId=1041592.1
>  ok snes_tutorials-ex19_cuda # SKIP Command failed so no diff
>
>
>
> On Fri, Jan 7, 2022 at 2:20 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> Well, this sure looks like it is deterministic:
>>
>> 11:00 130 nid002645 main= perlmutter:~/petsc$ make
>> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda -f gmakefile test
>> search='ts_utils_dmplexlandau_tutorials-ex1_cuda'
>> Using MAKEFLAGS: -- search=ts_utils_dmplexlandau_tutorials-ex1_cuda
>> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>>           CC
>> arch-perlmutter-opt-gcc-kokkos-cuda/tests/ts/utils/dmplexlandau/tutorials/ex1.o
>>      CLINKER
>> arch-perlmutter-opt-gcc-kokkos-cuda/tests/ts/utils/dmplexlandau/tutorials/ex1
>>         TEST
>> arch-perlmutter-opt-gcc-kokkos-cuda/tests/counts/ts_utils_dmplexlandau_tutorials-ex1_cuda.counts
>> # retrying ts_utils_dmplexlandau_tutorials-ex1_cuda
>> not ok ts_utils_dmplexlandau_tutorials-ex1_cuda # Error code: 97
>> # masses:        e= 9.109e-31; ions in proton mass units:    2.000e+00
>>  4.000e+00 ...
>> # charges:       e=-1.602e-19; charges in elementary units:  1.000e+00
>>  1.800e+01
>> # n:             e:  1.000e+00                           i:  1.000e+00
>>  1.000e-05
>> # thermal T (K): e= 5.802e+07 i= 5.802e+07  5.802e+06. v_0= 2.965e+07 (
>> 9.892e-02c) n_0= 1.000e+20 t_0= 6.470e-05, classical, Intuitive, 1 batched
>> # Domain radius (AMR levels) grid 0: 5. (2) , 1:  8.252e-02 (1)
>> # 0) FormLandau 352 IPs, 22 cells total, Nb=16, Nq=16, dim=2, Tab: Nb=16
>> Nf=3 Np=16 cdim=2 N=324
>> #  0) species-0: charge density= -1.6024538233648e+01 z-momentum=
>> -1.7133689250463e-19 energy=  1.2009868166183e+05
>> #  0) species-1: charge density=  1.6068752193414e+01 z-momentum=
>> -5.3757901114069e-19 energy=  1.1752123333205e+05
>> #  0) species-2: charge density=  2.7152642046328e-04 z-momentum=
>>  3.4016005887213e-21 energy=  1.1701658725855e+00
>> #  0) Total: charge density=  4.4485486187274e-02, momentum=
>> -7.0551430305659e-19, energy=  2.3762108515975e+05 (m_i[0]/m_e = 3670.94,
>> 14 cells on electron grid)
>> # 0 TS dt 0.1 time 0.
>> # [0]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>> # [0]PETSC ERROR: GPU error
>> # [0]PETSC ERROR: cuBLAS error 13 (CUBLAS_STATUS_EXECUTION_FAILED)
>> # [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>> shooting.
>> # [0]PETSC ERROR: Petsc Development GIT revision: v3.16.3-511-g96172674f3
>>  GIT Date: 2022-01-06 23:44:32 +0000
>> # [0]PETSC ERROR:
>> /global/u2/m/madams/petsc/arch-perlmutter-opt-gcc-kokkos-cuda/tests/ts/utils/dmplexlandau/tutorials/runex1_cuda/../ex1
>> on a arch-perlmutter-opt-gcc-kokkos-cuda named nid002645 by madams Fri Jan
>>  7 11:15:42 2022
>> # [0]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
>> -rdynamic -DLANDAU_DIM=2 -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4"
>> --with-cc=cc --with-cxx=CC --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>> --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
>> --with-zlib=1 --download-kokkos --download-kokkos-kernels
>> --with-kokkos-kernels-tpl=0 --with-make-np=8
>> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>> # [0]PETSC ERROR: #1 VecNorm_SeqCUDA() at
>> /global/u2/m/madams/petsc/src/vec/vec/impls/seq/seqcuda/veccuda2.cu:994
>> # [0]PETSC ERROR: #2 VecNorm() at
>> /global/u2/m/madams/petsc/src/vec/vec/interface/rvector.c:228
>> # [0]PETSC ERROR: #3 SNESSolve_NEWTONLS() at
>> /global/u2/m/madams/petsc/src/snes/impls/ls/ls.c:179
>> # [0]PETSC ERROR: #4 SNESSolve() at
>> /global/u2/m/madams/petsc/src/snes/interface/snes.c:4810
>> # [0]PETSC ERROR: #5 TSStep_ARKIMEX() at
>> /global/u2/m/madams/petsc/src/ts/impls/arkimex/arkimex.c:845
>> # [0]PETSC ERROR: #6 TSStep() at
>> /global/u2/m/madams/petsc/src/ts/interface/ts.c:3572
>> # [0]PETSC ERROR: #7 TSSolve() at
>> /global/u2/m/madams/petsc/src/ts/interface/ts.c:3971
>> # [0]PETSC ERROR: #8 main() at
>> /global/u2/m/madams/petsc/src/ts/utils/dmplexlandau/tutorials/ex1.c:45
>> # [0]PETSC ERROR: PETSc Option Table entries:
>> # [0]PETSC ERROR: -check_pointer_intensity 0
>> # [0]PETSC ERROR: -dm_landau_amr_levels_max 2,1
>> # [0]PETSC ERROR: -dm_landau_device_type cuda
>> # [0]PETSC ERROR: -dm_landau_ion_charges 1,18
>>
>> On Fri, Jan 7, 2022 at 1:52 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Fri, Jan 7, 2022 at 11:17 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> These are cuda/cusparse tests. The Kokkos versions are fine and
>>>> cusparse w/o a Kokkos build is fine.
>>>>
>>>> I do have some #ifdefs in the code. Maybe something snuck into the
>>>> #ifdef KOKKOS, but I can't imagine what that could even be.
>>>>
>>>> I have had problems with very large "cuda" jobs (on Summit with 21 MPI
>>>> processes per GPU) running out of "resources" with a Kokkos build, that
>>>> went away with a pure CUDA build (ie, w/o Kokkos), but these are tiny tests.
>>>>
>>> If Kokkos is initialized on MPI ranks, then each rank will consume
>>> resources on GPU.
>>>
>>>>
>>>> I will try it again.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Fri, Jan 7, 2022 at 12:06 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>> wrote:
>>>>
>>>>> It failed when you did not even pass any vec/mat kokkos options?  It
>>>>> does not make sense and you need to double check that.
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Thu, Jan 6, 2022 at 9:33 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>>> I seem to have a regression with using aijcusprase in a kokkos build.
>>>>>> It's OK with a straight CUDA build.
>>>>>>
>>>>>> # [0]PETSC ERROR: --------------------- Error Message
>>>>>> --------------------------------------------------------------
>>>>>> # [0]PETSC ERROR: GPU error
>>>>>> # [0]PETSC ERROR: cuBLAS error 13 (CUBLAS_STATUS_EXECUTION_FAILED)
>>>>>> # [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble
>>>>>> shooting.
>>>>>> # [0]PETSC ERROR: Petsc Development GIT revision:
>>>>>> v3.16.3-511-g96172674f3  GIT Date: 2022-01-06 23:44:32 +0000
>>>>>> # [0]PETSC ERROR:
>>>>>> /global/u2/m/madams/petsc_install/petsc/arch-perlmutter-opt-gcc-kokkos-cuda/tests/ts/utils/dmplexlandau/tutorials/runex1_cuda/../ex1
>>>>>> on a arch-perlmutter-opt-gcc-kokkos-cuda named nid003188 by madams Thu Jan
>>>>>>  6 19:29:06 2022
>>>>>> # [0]PETSC ERROR: Configure options --CFLAGS="   -g -DLANDAU_DIM=2
>>>>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CXXFLAGS=" -g -DLANDAU_DIM=2
>>>>>> -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4" --CUDAFLAGS="-g -Xcompiler
>>>>>> -rdynamic -DLANDAU_DIM=2 -DLANDAU_MAX_SPECIES=10 -DLANDAU_MAX_Q=4"
>>>>>> --with-cc=cc --with-cxx=CC --with-fc=ftn --LDFLAGS=-lmpifort_gnu_91
>>>>>> --with-cudac=/global/common/software/nersc/cos1.3/cuda/11.3.0/bin/nvcc
>>>>>> --COPTFLAGS="   -O3" --CXXOPTFLAGS=" -O3" --FOPTFLAGS="   -O3"
>>>>>> --with-debugging=0 --download-metis --download-parmetis --with-cuda=1
>>>>>> --with-cuda-arch=80 --with-mpiexec=srun --with-batch=0 --download-p4est=1
>>>>>> --with-zlib=1 --download-kokkos --download-kokkos-kernels
>>>>>> --with-kokkos-kernels-tpl=0 --with-make-np=8
>>>>>> PETSC_DIR=/global/homes/m/madams/petsc_install/petsc
>>>>>> PETSC_ARCH=arch-perlmutter-opt-gcc-kokkos-cuda
>>>>>> # [0]PETSC ERROR: #1 VecNorm_SeqCUDA() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/vec/vec/impls/seq/seqcuda/
>>>>>> veccuda2.cu:994
>>>>>> # [0]PETSC ERROR: #2 VecNorm() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/vec/vec/interface/rvector.c:228
>>>>>> # [0]PETSC ERROR: #3 SNESSolve_NEWTONLS() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/snes/impls/ls/ls.c:179
>>>>>> # [0]PETSC ERROR: #4 SNESSolve() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/snes/interface/snes.c:4810
>>>>>> # [0]PETSC ERROR: #5 TSStep_ARKIMEX() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/ts/impls/arkimex/arkimex.c:845
>>>>>> # [0]PETSC ERROR: #6 TSStep() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/ts/interface/ts.c:3572
>>>>>> # [0]PETSC ERROR: #7 TSSolve() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/ts/interface/ts.c:3971
>>>>>> # [0]PETSC ERROR: #8 main() at
>>>>>> /global/u2/m/madams/petsc_install/petsc/src/ts/utils/dmplexlandau/tutorials/ex1.c:45
>>>>>> # [0]PETSC ERROR: PETSc Option Table entries:
>>>>>> # [0]PETSC ERROR: -check_pointer_intensity 0
>>>>>> # [0]PETSC ERROR: -dm_landau_amr_levels_max 2,1
>>>>>> # [0]PETSC ERROR: -dm_landau_device_type cuda
>>>>>> # [0]PETSC ERROR: -dm_landau_ion_charges 1,18
>>>>>> # [0]PETSC ERROR: -dm_landau_ion_masses 2,4
>>>>>> # [0]PETSC ERROR: -dm_landau_n 1.00018,1,1e-5
>>>>>> # [0]PETSC ERROR: -dm_landau_n_0 1e20
>>>>>> # [0]PETSC ERROR: -dm_landau_num_species_grid 1,2
>>>>>> # [0]PETSC ERROR: -dm_landau_thermal_temps 5,5,.5
>>>>>> # [0]PETSC ERROR: -dm_landau_type p4est
>>>>>> # [0]PETSC ERROR: -dm_mat_type aijcusparse
>>>>>> # [0]PETSC ERROR: -dm_preallocate_only false
>>>>>> # [0]PETSC ERROR: -dm_vec_type cuda
>>>>>> # [0]PETSC ERROR: -error_output_stdout
>>>>>> # [0]PETSC ERROR: -ksp_type preonly
>>>>>> # [0]PETSC ERROR: -malloc_dump
>>>>>> # [0]PETSC ERROR: -mat_cusparse_use_cpu_solve
>>>>>> # [0]PETSC ERROR: -nox
>>>>>> # [0]PETSC ERROR: -nox_warning
>>>>>> # [0]PETSC ERROR: -pc_type lu
>>>>>> # [0]PETSC ERROR: -petscspace_degree 3
>>>>>> # [0]PETSC ERROR: -petscspace_poly_tensor 1
>>>>>> # [0]PETSC ERROR: -snes_converged_reason
>>>>>> # [0]PETSC ERROR: -snes_monitor
>>>>>> # [0]PETSC ERROR: -snes_rtol 1.e-14
>>>>>> # [0]PETSC ERROR: -snes_stol 1.e-14
>>>>>> # [0]PETSC ERROR: -ts_adapt_clip .5,1.25
>>>>>> # [0]PETSC ERROR: -ts_adapt_scale_solve_failed 0.75
>>>>>> # [0]PETSC ERROR: -ts_adapt_time_step_increase_delay 5
>>>>>> # [0]PETSC ERROR: -ts_arkimex_type 1bee
>>>>>> # [0]PETSC ERROR: -ts_dt 1.e-1
>>>>>> # [0]PETSC ERROR: -ts_max_snes_failures -1
>>>>>> # [0]PETSC ERROR: -ts_max_steps 1
>>>>>> # [0]PETSC ERROR: -ts_max_time 1
>>>>>> # [0]PETSC ERROR: -ts_monitor
>>>>>> # [0]PETSC ERROR: -ts_rtol 1e-1
>>>>>> # [0]PETSC ERROR: -ts_type arkimex
>>>>>> # [0]PETSC ERROR: -use_gpu_aware_mpi 0
>>>>>> # [0]PETSC ERROR: ----------------End of Error Message -------send
>>>>>> entire error message to petsc-maint at mcs.anl.gov----------
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220107/503d9a13/attachment-0001.html>


More information about the petsc-dev mailing list