[petsc-users] [EXTERNAL] Re: Using multiple MPI ranks with COO interface crashes in some cases

Junchao Zhang junchao.zhang at gmail.com
Thu Dec 1 13:54:30 CST 2022


Hi,  Philip,
  The petsc bug is fixed in
https://gitlab.com/petsc/petsc/-/merge_requests/5892, which is now in
petsc/release and will be merged to petsc/main
  Thanks.

--Junchao Zhang


On Tue, Nov 22, 2022 at 11:56 AM Fackler, Philip <facklerpw at ornl.gov> wrote:

> Great! Thank you!
>
>
> *Philip Fackler *
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> *Oak Ridge National Laboratory*
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Tuesday, November 22, 2022 12:02
> *To:* Fackler, Philip <facklerpw at ornl.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Blondel, Sophie <
> sblondel at utk.edu>
> *Subject:* Re: [EXTERNAL] Re: [petsc-users] Using multiple MPI ranks with
> COO interface crashes in some cases
>
>
>
>
> On Tue, Nov 22, 2022 at 10:14 AM Fackler, Philip <facklerpw at ornl.gov>
> wrote:
>
> Yes, that one is. I haven't updated the tests. So just build the
> SystemTester target or the xolotl target.
>
> OK, I see. I reproduced the petsc error and am looking into it.  Thanks a
> lot.
>
>
>
>
> *Philip Fackler *
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> *Oak Ridge National Laboratory*
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Monday, November 21, 2022 15:36
> *To:* Fackler, Philip <facklerpw at ornl.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Blondel, Sophie <
> sblondel at utk.edu>
> *Subject:* Re: [EXTERNAL] Re: [petsc-users] Using multiple MPI ranks with
> COO interface crashes in some cases
>
>
>
>
> On Mon, Nov 21, 2022 at 9:31 AM Fackler, Philip <facklerpw at ornl.gov>
> wrote:
>
> Not sure why. I'm using the same compiler. But you can try constructing
> the object explicitly on that line:
>
> idPairs.push_back(core::RowColPair{i, i});
>
>
> WIth your change, I continued but met another error:
>
>  /home/jczhang/xolotl/test/core/diffusion/Diffusion2DHandlerTester.cpp(79):
> error: class "xolotl::core::diffusion::Diffusion2DHandler" has no member
> "initializeOFill"
>
> it seems all these problems are related to the branch
> * feature-petsc-kokkos, *instead of the compiler etc. When I switched to
> origin/stable,  I could build xolotl.
>
>
>
>
> *Philip Fackler *
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> *Oak Ridge National Laboratory*
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Sunday, November 20, 2022 13:25
> *To:* Fackler, Philip <facklerpw at ornl.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Blondel, Sophie <
> sblondel at utk.edu>
> *Subject:* Re: [EXTERNAL] Re: [petsc-users] Using multiple MPI ranks with
> COO interface crashes in some cases
>
>
>
> On Tue, Nov 15, 2022 at 10:55 AM Fackler, Philip <facklerpw at ornl.gov>
> wrote:
>
> I built petsc with:
>
> $ ./configure PETSC_DIR=$PWD PETSC_ARCH=arch-kokkos-serial-debug
> --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --with-debugging=0
> --prefix=$HOME/build/petsc/debug/install --with-64-bit-indices
> --with-shared-libraries --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 --download-kokkos
> --download-kokkos-kernels
>
> $ make PETSC_DIR=$PWD PETSC_ARCH=arch-kokkos-serial-debug all
>
> $ make PETSC_DIR=$PWD PETSC_ARCH=arch-kokkos-serial-debug install
>
>
> Then I build xolotl in a separate build directory (after checking out the
> "feature-petsc-kokkos" branch) with:
>
> $ cmake -DCMAKE_BUILD_TYPE=Debug
> -DKokkos_DIR=$HOME/build/petsc/debug/install
> -DPETSC_DIR=$HOME/build/petsc/debug/install <xolotl-src>
>
> $ make -j4 SystemTester
>
> Hi, Philip,  I tried multiple times and still failed at building xolotl.
> I installed boost-1.74 and HDF5, and used gcc-11.3.
>
> make -j4 SystemTester
> ...
> [  9%] Building CXX object
> xolotl/core/CMakeFiles/xolotlCore.dir/src/diffusion/DiffusionHandler.cpp.o
> /home/jczhang/xolotl/xolotl/core/src/diffusion/DiffusionHandler.cpp(55):
> error: no instance of overloaded function "std::vector<_Tp,
> _Alloc>::push_back [with _Tp=xolotl::core::RowColPair,
> _Alloc=std::allocator<xolotl::core::RowColPair>]" matches the argument list
>             argument types are: ({...})
>             object type is: std::vector<xolotl::core::RowColPair,
> std::allocator<xolotl::core::RowColPair>>
>
> 1 error detected in the compilation of
> "/home/jczhang/xolotl/xolotl/core/src/diffusion/DiffusionHandler.cpp".
>
>
>
>
> Then, from the xolotl build directory, run (for example):
>
> $ mpirun -n 2 ./test/system/SystemTester -t System/NE_4 -- -v
>
> Note that this test case will use the parameter file
> '<xolotl-src>/benchmarks/params_system_NE_4.txt' which has the command-line
> arguments for petsc in its "petscArgs=..." line. If you look at
> '<xolotl-src>/test/system/SystemTester.cpp' all the system test cases
> follow the same naming convention with their corresponding parameter files
> under '<xolotl-src>/benchmarks'.
>
> The failure happens with the NE_4 case (which is 2D) and the PSI_3 case
> (which is 1D).
>
> Let me know if this is still unclear.
>
> Thanks,
>
>
> *Philip Fackler *
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> *Oak Ridge National Laboratory*
> ------------------------------
> *From:* Junchao Zhang <junchao.zhang at gmail.com>
> *Sent:* Tuesday, November 15, 2022 00:16
> *To:* Fackler, Philip <facklerpw at ornl.gov>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>; Blondel, Sophie <
> sblondel at utk.edu>
> *Subject:* [EXTERNAL] Re: [petsc-users] Using multiple MPI ranks with COO
> interface crashes in some cases
>
> Hi, Philip,
>   Can you tell me instructions to build Xolotl to reproduce the error?
> --Junchao Zhang
>
>
> On Mon, Nov 14, 2022 at 12:24 PM Fackler, Philip via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
>
> In Xolotl's "feature-petsc-kokkos" branch, I have moved our code to use
> the COO interface for preallocating and setting values in the Jacobian
> matrix. I have found that with some of our test cases, using more than one
> MPI rank results in a crash. Way down in the preconditioner code in petsc a
> Mat gets computed that has "null" for the "productsymbolic" member of its
> "ops". It's pretty far removed from where we compute the Jacobian entries,
> so I haven't been able (so far) to track it back to an error in my code.
> I'd appreciate some help with this from someone who is more familiar with
> the petsc guts so we can figure out what I'm doing wrong. (I'm assuming
> it's a bug in Xolotl.)
>
> Note that this is using the kokkos backend for Mat and Vec in petsc, but
> with a serial-only build of kokkos and kokkos-kernels. So, it's a CPU-only
> multiple MPI rank run.
>
> Here's a paste of the error output showing the relevant parts of the call
> stack:
>
> [ERROR] [0]PETSC ERROR:
> [ERROR] --------------------- Error Message
> --------------------------------------------------------------
> [ERROR] [1]PETSC ERROR:
> [ERROR] --------------------- Error Message
> --------------------------------------------------------------
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] No support for this operation for this object type
> [ERROR] [1]PETSC ERROR:
> [ERROR] No support for this operation for this object type
> [ERROR] [0]PETSC ERROR:
> [ERROR] No method productsymbolic for Mat of type (null)
> [ERROR] No method productsymbolic for Mat of type (null)
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] See hxxps://petsc.org/release/faq/ for trouble shooting.
> [ERROR] See hxxps://petsc.org/release/faq/ for trouble shooting.
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] Petsc Development GIT revision: v3.18.1-115-gdca010e0e9a  GIT
> Date: 2022-10-28 14:39:41 +0000
> [ERROR] Petsc Development GIT revision: v3.18.1-115-gdca010e0e9a  GIT
> Date: 2022-10-28 14:39:41 +0000
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] Unknown Name on a  named PC0115427 by 4pf Mon Nov 14 13:22:01 2022
> [ERROR] Unknown Name on a  named PC0115427 by 4pf Mon Nov 14 13:22:01 2022
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] Configure options PETSC_DIR=/home/4pf/repos/petsc
> PETSC_ARCH=arch-kokkos-serial-debug --with-debugging=1 --with-cc=mpicc
> --with-cxx=mpicxx --with-fc=0 --with-cudac=0
> --prefix=/home/4pf/build/petsc/serial-debug/install --with-64-bit-indices
> --with-shared-libraries
> --with-kokkos-dir=/home/4pf/build/kokkos/serial/install
> --with-kokkos-kernels-dir=/home/4pf/build/kokkos-kernels/serial/install
> [ERROR] Configure options PETSC_DIR=/home/4pf/repos/petsc
> PETSC_ARCH=arch-kokkos-serial-debug --with-debugging=1 --with-cc=mpicc
> --with-cxx=mpicxx --with-fc=0 --with-cudac=0
> --prefix=/home/4pf/build/petsc/serial-debug/install --with-64-bit-indices
> --with-shared-libraries
> --with-kokkos-dir=/home/4pf/build/kokkos/serial/install
> --with-kokkos-kernels-dir=/home/4pf/build/kokkos-kernels/serial/install
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #1 MatProductSymbolic_MPIAIJKokkos_AB() at
> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:918
> [ERROR] #1 MatProductSymbolic_MPIAIJKokkos_AB() at
> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:918
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #2 MatProductSymbolic_MPIAIJKokkos() at
> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1138
> [ERROR] #2 MatProductSymbolic_MPIAIJKokkos() at
> /home/4pf/repos/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1138
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #3 MatProductSymbolic() at
> /home/4pf/repos/petsc/src/mat/interface/matproduct.c:793
> [ERROR] #3 MatProductSymbolic() at
> /home/4pf/repos/petsc/src/mat/interface/matproduct.c:793
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #4 MatProduct_Private() at
> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9820
> [ERROR] #4 MatProduct_Private() at
> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9820
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] #5 MatMatMult() at
> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9897
> [ERROR] #5 MatMatMult() at
> /home/4pf/repos/petsc/src/mat/interface/matrix.c:9897
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] #6 PCGAMGOptProlongator_AGG() at
> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/agg.c:769
> [ERROR] #6 PCGAMGOptProlongator_AGG() at
> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/agg.c:769
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] #7 PCSetUp_GAMG() at
> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/gamg.c:639
> [ERROR] #7 PCSetUp_GAMG() at
> /home/4pf/repos/petsc/src/ksp/pc/impls/gamg/gamg.c:639
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #8 PCSetUp() at
> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:994
> [ERROR] #8 PCSetUp() at
> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:994
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #9 KSPSetUp() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:406
> [ERROR] #9 KSPSetUp() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:406
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #10 KSPSolve_Private() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:825
> [ERROR] #10 KSPSolve_Private() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:825
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] #11 KSPSolve() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
> [ERROR] #11 KSPSolve() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #12 PCApply_FieldSplit() at
> /home/4pf/repos/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1246
> [ERROR] #12 PCApply_FieldSplit() at
> /home/4pf/repos/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1246
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #13 PCApply() at
> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:441
> [ERROR] #13 PCApply() at
> /home/4pf/repos/petsc/src/ksp/pc/interface/precon.c:441
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #14 KSP_PCApply() at
> /home/4pf/repos/petsc/include/petsc/private/kspimpl.h:380
> [ERROR] #14 KSP_PCApply() at
> /home/4pf/repos/petsc/include/petsc/private/kspimpl.h:380
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #15 KSPFGMRESCycle() at
> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:152
> [ERROR] #15 KSPFGMRESCycle() at
> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:152
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #16 KSPSolve_FGMRES() at
> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:273
> [ERROR] #16 KSPSolve_FGMRES() at
> /home/4pf/repos/petsc/src/ksp/ksp/impls/gmres/fgmres/fgmres.c:273
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #17 KSPSolve_Private() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:899
> [ERROR] #17 KSPSolve_Private() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:899
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] #18 KSPSolve() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
> [ERROR] #18 KSPSolve() at
> /home/4pf/repos/petsc/src/ksp/ksp/interface/itfunc.c:1071
> [ERROR] [0]PETSC ERROR:
> [ERROR] [1]PETSC ERROR:
> [ERROR] #19 SNESSolve_NEWTONLS() at
> /home/4pf/repos/petsc/src/snes/impls/ls/ls.c:210
> [ERROR] #19 SNESSolve_NEWTONLS() at
> /home/4pf/repos/petsc/src/snes/impls/ls/ls.c:210
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #20 SNESSolve() at
> /home/4pf/repos/petsc/src/snes/interface/snes.c:4689
> [ERROR] #20 SNESSolve() at
> /home/4pf/repos/petsc/src/snes/interface/snes.c:4689
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #21 TSStep_ARKIMEX() at
> /home/4pf/repos/petsc/src/ts/impls/arkimex/arkimex.c:791
> [ERROR] #21 TSStep_ARKIMEX() at
> /home/4pf/repos/petsc/src/ts/impls/arkimex/arkimex.c:791
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #22 TSStep() at /home/4pf/repos/petsc/src/ts/interface/ts.c:3445
> [ERROR] #22 TSStep() at /home/4pf/repos/petsc/src/ts/interface/ts.c:3445
> [ERROR] [1]PETSC ERROR:
> [ERROR] [0]PETSC ERROR:
> [ERROR] #23 TSSolve() at /home/4pf/repos/petsc/src/ts/interface/ts.c:3836
> [ERROR] #23 TSSolve() at /home/4pf/repos/petsc/src/ts/interface/ts.c:3836
> [ERROR] PetscSolver::solve: TSSolve failed.
> [ERROR] PetscSolver::solve: TSSolve failed.
> Aborting.
> Aborting.
>
>
>
> Thanks for the help,
>
>
> *Philip Fackler *
> Research Software Engineer, Application Engineering Group
> Advanced Computing Systems Research Section
> Computer Science and Mathematics Division
> *Oak Ridge National Laboratory*
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20221201/b1227149/attachment-0001.html>


More information about the petsc-users mailing list