[petsc-users] [EXTERNAL] Re: Using multiple MPI ranks with COO interface crashes in some cases

Junchao Zhang junchao.zhang at gmail.com
Sun Nov 20 12:31:28 CST 2022


Hi, Mark,
 On Perlmutter, I have

export MPICH_GPU_SUPPORT_ENABLED=1
module load cudatoolkit
module load PrgEnv-gnu
module load craype-accel-nvidia80

$ module list
Currently Loaded Modules:
  1) craype-x86-milan     4) perftools-base/22.06.0                  7)
xalt/2.10.2              10) Nsight-Systems/2022.2.1  13) cray-dsmml/0.2.2
      16) PrgEnv-gnu/8.3.3
  2) libfabric/1.15.0.0   5) xpmem/2.4.4-2.3_13.8__gff0e1d9.shasta   8)
gpu/1.0                  11) cudatoolkit/11.7         14) cray-mpich/8.1.17
     17) craype-accel-nvidia80
  3) craype-network-ofi   6) gcc/11.2.0                              9)
Nsight-Compute/2022.1.1  12) craype/2.7.16            15) cray-libsci/
21.08.1.2  18) cmake/3.22.0

And my petsc configure is as follows and I can build petsc on it.  As I
knew, the only problem on perlmutter is KK failed to find TPL.  Turning off
TPL is a workaround.

   '--with-debugging',
    '--with-cc=cc',
    '--with-cxx=CC',
    '--with-fc=ftn',
    '--download-sowing-cc=cc', # cc might be nvc
    '--CFLAGS=-g -O0',
    '--FFLAGS=-g -O0',
    '--CXXFLAGS=-g -O0',
    '--with-cuda',
    '--with-cudac=nvcc',
    '--download-kokkos',
    '--download-kokkos-kernels',
    '--download-kokkos-commit=origin/develop',
    '--download-kokkos-kernels-commit=origin/develop',
    '--with-kokkos-kernels-tpl=0',

--Junchao Zhang


On Wed, Nov 16, 2022 at 7:05 AM Mark Adams <mfadams at lbl.gov> wrote:

> I can not build right now on Crusher or Perlmutter but I saw this on both.
>
> Here is an example output using src/snes/tests/ex13.c using the appended
> .petscrc
> This uses 64 processors and the 8 processor case worked. This has been
> semi-nondertminisitc for me.
>
> (and I have attached my current Perlmutter problem)
>
> Hope this helps,
> Mark
>
> -dm_plex_simplex 0
> -dm_plex_dim 3
> -dm_plex_box_lower 0,0,0
> -dm_plex_box_upper 1,1,1
> -petscpartitioner_simple_process_grid 2,2,2
> -potential_petscspace_degree 2
> -snes_max_it 1
> -ksp_max_it 200
> -ksp_type cg
> -ksp_rtol 1.e-12
> -ksp_norm_type unpreconditioned
> -snes_rtol 1.e-8
> #-pc_type gamg
> #-pc_gamg_type agg
> #-pc_gamg_agg_nsmooths 1
> -pc_gamg_coarse_eq_limit 100
> -pc_gamg_process_eq_limit 400
> -pc_gamg_reuse_interpolation true
> #-snes_monitor
> #-ksp_monitor_short
> -ksp_converged_reason
> #-ksp_view
> #-snes_converged_reason
> #-mg_levels_ksp_max_it 2
> -mg_levels_ksp_type chebyshev
> #-mg_levels_ksp_type richardson
> #-mg_levels_ksp_richardson_scale 0.8
> -mg_levels_pc_type jacobi
> -pc_gamg_esteig_ksp_type cg
> -pc_gamg_esteig_ksp_max_it 10
> -mg_levels_ksp_chebyshev_esteig 0,0.05,0,1.05
> -dm_distribute
> -petscpartitioner_type simple
> -pc_gamg_repartition false
> -pc_gamg_coarse_grid_layout_type compact
> -pc_gamg_threshold 0.01
> #-pc_gamg_threshold_scale .5
> -pc_gamg_aggressive_coarsening 1
> #-check_pointer_intensity 0
> -snes_type ksponly
> #-mg_coarse_sub_pc_factor_mat_solver_type cusparse
> #-info :pc
> #-use_gpu_aware_mpi 1
> -options_left
> #-malloc_debug
> -benchmark_it 10
> #-pc_gamg_use_parallel_coarse_grid_solver
> #-mg_coarse_pc_type jacobi
> #-mg_coarse_ksp_type cg
> #-mg_coarse_ksp_rtol 1.e-2
> #-mat_cusparse_transgen
> -snes_lag_jacobian -2
>
>
> On Tue, Nov 15, 2022 at 3:42 PM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> Mark,
>> Do you have a reproducer using petsc examples?
>>
>> On Tue, Nov 15, 2022, 12:49 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> Junchao, this is the same problem that I have been having right?
>>>
>>> On Tue, Nov 15, 2022 at 11:56 AM Fackler, Philip via petsc-users <
>>> petsc-users at mcs.anl.gov> wrote:
>>>
>>>> I built petsc with:
>>>>
>>>> $ ./configure PETSC_DIR=$PWD PETSC_ARCH=arch-kokkos-serial-debug
>>>> --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --with-debugging=0
>>>> --prefix=$HOME/build/petsc/debug/install --with-64-bit-indices
>>>> --with-shared-libraries --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --download-kokkos
>>>> --download-kokkos-kernels
>>>>
>>>> $ make PETSC_DIR=$PWD PETSC_ARCH=arch-kokkos-serial-debug all
>>>>
>>>> $ make PETSC_DIR=$PWD PETSC_ARCH=arch-kokkos-serial-debug install
>>>>
>>>>
>>>> Then I build xolotl in a separate build directory (after checking out
>>>> the "feature-petsc-kokkos" branch) with:
>>>>
>>>> $ cmake -DCMAKE_BUILD_TYPE=Debug
>>>> -DKokkos_DIR=$HOME/build/petsc/debug/install
>>>> -DPETSC_DIR=$HOME/build/petsc/debug/install <xolotl-src>
>>>>
>>>> $ make -j4 SystemTester
>>>>
>>>>
>>>> Then, from the xolotl build directory, run (for example):
>>>>
>>>> $ mpirun -n 2 ./test/system/SystemTester -t System/NE_4 -- -v
>>>>
>>>> Note that this test case will use the parameter file
>>>> '<xolotl-src>/benchmarks/params_system_NE_4.txt' which has the command-line
>>>> arguments for petsc in its "petscArgs=..." line. If you look at
>>>> '<xolotl-src>/test/system/SystemTester.cpp' all the system test cases
>>>> follow the same naming convention with their corresponding parameter files
>>>> under '<xolotl-src>/benchmarks'.
>>>>
>>>> The failure happens with the NE_4 case (which is 2D) and the PSI_3 case
>>>> (which is 1D).
>>>>
>>>> Let me know if this is still unclear.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> *Philip Fackler *
>>>> Research Software Engineer, Application Engineering Group
>>>> Advanced Computing Systems Research Section
>>>> Computer Science and Mathematics Division
>>>> *Oak Ridge National Laboratory*
>>>> ------------------------------
>>>> *From:* Junchao Zhang <junchao.zhang at gmail.com>
>>>> *Sent:* Tuesday, November 15, 2022 00:16
>>>> *To:* Fackler, Philip <facklerpw at ornl.gov>
>>>> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Blondel,
>>>> Sophie <sblondel at utk.edu>
>>>> *Subject:* [EXTERNAL] Re: [petsc-users] Using multiple MPI ranks with
>>>> COO interface crashes in some cases
>>>>
>>>> Hi, Philip,
>>>>   Can you tell me instructions to build Xolotl to reproduce the error?
>>>> --Junchao Zhang
>>>>
>>>>
>>>> On Mon, Nov 14, 2022 at 12:24 PM Fackler, Philip via petsc-users <
>>>> petsc-users at mcs.anl.gov> wrote:
>>>>
>>>> In Xolotl's "feature-petsc-kokkos" branch, I have moved our code to use
>>>> the COO interface for preallocating and setting values in the Jacobian
>>>> matrix. I have found that with some of our test cases, using more than one
>>>> MPI rank results in a crash. Way down in the preconditioner code in petsc a
>>>> Mat gets computed that has "null" for the "productsymbolic" member of its
>>>> "ops". It's pretty far removed from where we compute the Jacobian entries,
>>>> so I haven't been able (so far) to track it back to an error in my code.
>>>> I'd appreciate some help with this from someone who is more familiar with
>>>> the petsc guts so we can figure out what I'm doing wrong. (I'm assuming
>>>> it's a bug in Xolotl.)
>>>>
>>>> Note that this is using the kokkos backend for Mat and Vec in petsc,
>>>> but with a serial-only build of kokkos and kokkos-kernels. So, it's a
>>>> CPU-only multiple MPI rank run.
>>>>
>>>> Here's a paste of the error output showing the relevant parts of the
>>>> call stack:
>>>>
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] No support for this operation for this object type
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] No support for this operation for this object type
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] No method productsymbolic for Mat of type (null)
>>>> [ERROR] No method productsymbolic for Mat of type (null)
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] See hxxps://petsc.org/release/faq/ for trouble shooting.
>>>> [ERROR] See hxxps://petsc.org/release/faq/ for trouble shooting.
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] Petsc Development GIT revision: v3.18.1-115-gdca010e0e9a  GIT
>>>> Date: 2022-10-28 14:39:41 +0000
>>>> [ERROR] Petsc Development GIT revision: v3.18.1-115-gdca010e0e9a  GIT
>>>> Date: 2022-10-28 14:39:41 +0000
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] Unknown Name on a  named PC0115427 by 4pf Mon Nov 14 13:22:01
>>>> 2022
>>>> [ERROR] Unknown Name on a  named PC0115427 by 4pf Mon Nov 14 13:22:01
>>>> 2022
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] Configure options PETSC_DIR=/home/4pf/repos/petsc
>>>> PETSC_ARCH=arch-kokkos-serial-debug --with-debugging=1 --with-cc=mpicc
>>>> --with-cxx=mpicxx --with-fc=0 --with-cudac=0
>>>> --prefix=/home/4pf/build/petsc/serial-debug/install --with-64-bit-indices
>>>> --with-shared-libraries
>>>> --with-kokkos-dir=/home/4pf/build/kokkos/serial/install
>>>> --with-kokkos-kernels-dir=/home/4pf/build/kokkos-kernels/serial/install
>>>> [ERROR] Configure options PETSC_DIR=/home/4pf/repos/petsc
>>>> PETSC_ARCH=arch-kokkos-serial-debug --with-debugging=1 --with-cc=mpicc
>>>> --with-cxx=mpicxx --with-fc=0 --with-cudac=0
>>>> --prefix=/home/4pf/build/petsc/serial-debug/install --with-64-bit-indices
>>>> --with-shared-libraries
>>>> --with-kokkos-dir=/home/4pf/build/kokkos/serial/install
>>>> --with-kokkos-kernels-dir=/home/4pf/build/kokkos-kernels/serial/install
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #1 MatProductSymbolic_MPIAIJKokkos_AB() at
>>>> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:918
>>>> [ERROR] #1 MatProductSymbolic_MPIAIJKokkos_AB() at
>>>> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:918
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #2 MatProductSymbolic_MPIAIJKokkos() at
>>>> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1138
>>>> [ERROR] #2 MatProductSymbolic_MPIAIJKokkos() at
>>>> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1138
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #3 MatProductSymbolic() at
>>>> /home/4pf/repos/petsc/src/mat/interface/matproduct.c:793
>>>> [ERROR] #3 MatProductSymbolic() at
>>>> /home/4pf/repos/petsc/src/mat/interface/matproduct.c:793
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #4 MatProduct_Private() at
>>>> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9820
>>>> [ERROR] #4 MatProduct_Private() at
>>>> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9820
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] #5 MatMatMult() at
>>>> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9897
>>>> [ERROR] #5 MatMatMult() at
>>>> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9897
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] #6 PCGAMGOptProlongator_AGG() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/agg.c:769
>>>> [ERROR] #6 PCGAMGOptProlongator_AGG() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/agg.c:769
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] #7 PCSetUp_GAMG() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/gamg.c:639
>>>> [ERROR] #7 PCSetUp_GAMG() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/gamg.c:639
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #8 PCSetUp() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:994
>>>> [ERROR] #8 PCSetUp() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:994
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #9 KSPSetUp() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:406
>>>> [ERROR] #9 KSPSetUp() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:406
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #10 KSPSolve_Private() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:825
>>>> [ERROR] #10 KSPSolve_Private() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:825
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] #11 KSPSolve() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
>>>> [ERROR] #11 KSPSolve() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #12 PCApply_FieldSplit() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1246
>>>> [ERROR] #12 PCApply_FieldSplit() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1246
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #13 PCApply() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:441
>>>> [ERROR] #13 PCApply() at
>>>> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:441
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #14 KSP_PCApply() at
>>>> /home/4pf/repos/petsc/include/petsc/private/kspimpl.h:380
>>>> [ERROR] #14 KSP_PCApply() at
>>>> /home/4pf/repos/petsc/include/petsc/private/kspimpl.h:380
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #15 KSPFGMRESCycle() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:152
>>>> [ERROR] #15 KSPFGMRESCycle() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:152
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #16 KSPSolve_FGMRES() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:273
>>>> [ERROR] #16 KSPSolve_FGMRES() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:273
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #17 KSPSolve_Private() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:899
>>>> [ERROR] #17 KSPSolve_Private() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:899
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] #18 KSPSolve() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
>>>> [ERROR] #18 KSPSolve() at
>>>> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] #19 SNESSolve_NEWTONLS() at
>>>> /home/4pf/repos/petsc/src/snes/impls/ls/ls.c:210
>>>> [ERROR] #19 SNESSolve_NEWTONLS() at
>>>> /home/4pf/repos/petsc/src/snes/impls/ls/ls.c:210
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #20 SNESSolve() at
>>>> /home/4pf/repos/petsc/src/snes/interface/snes.c:4689
>>>> [ERROR] #20 SNESSolve() at
>>>> /home/4pf/repos/petsc/src/snes/interface/snes.c:4689
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #21 TSStep_ARKIMEX() at
>>>> /home/4pf/repos/petsc/src/ts/impls/arkimex/arkimex.c:791
>>>> [ERROR] #21 TSStep_ARKIMEX() at
>>>> /home/4pf/repos/petsc/src/ts/impls/arkimex/arkimex.c:791
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #22 TSStep() at /home/4pf/repos/petsc/src/ts/interface/ts.c:3445
>>>> [ERROR] #22 TSStep() at /home/4pf/repos/petsc/src/ts/interface/ts.c:3445
>>>> [ERROR] [1]PETSC ERROR:
>>>> [ERROR] [0]PETSC ERROR:
>>>> [ERROR] #23 TSSolve() at
>>>> /home/4pf/repos/petsc/src/ts/interface/ts.c:3836
>>>> [ERROR] #23 TSSolve() at
>>>> /home/4pf/repos/petsc/src/ts/interface/ts.c:3836
>>>> [ERROR] PetscSolver::solve: TSSolve failed.
>>>> [ERROR] PetscSolver::solve: TSSolve failed.
>>>> Aborting.
>>>> Aborting.
>>>>
>>>>
>>>>
>>>> Thanks for the help,
>>>>
>>>>
>>>> *Philip Fackler *
>>>> Research Software Engineer, Application Engineering Group
>>>> Advanced Computing Systems Research Section
>>>> Computer Science and Mathematics Division
>>>> *Oak Ridge National Laboratory*
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20221120/09ff5809/attachment-0001.html>


More information about the petsc-users mailing list