[petsc-users] Bug or mis-use for 64 indices PETSc mpi linear solver server with more than 8 cores

Lin_Yuxiang linyx199071 at gmail.com
Wed Sep 4 14:52:59 CDT 2024


To whom it may concern:



I recently tried to use the 64 indices PETSc to replace the legacy code's
solver using MPI linear solver server. However, it gives me error when I
use more than 8 cores, saying



Get NNZ

MatsetPreallocation

MatsetValue

MatSetValue Time per kernel: 43.1147 s

Matassembly

VecsetValue

pestc_solve

Read -1, expected 1951397280, errno = 14



When I tried the -start_in_debugger, the error seems from MPI_Scatter:



Rank0:

#3  0x00001555512e4de5 in mca_pml_ob1_recv () from
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so

#4  0x0000155553e01e60 in PMPI_Scatterv () from
/lib/x86_64-linux-gnu/libmpi.so.40

#5  0x0000155554b13eab in PCMPISetMat (pc=pc at entry=0x0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:230

#6  0x0000155554b17403 in PCMPIServerBegin () at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:464

#7  0x00001555540b9aa4 in PetscInitialize_Common (prog=0x7fffffffe27b
"geosimtrs_mpiserver", file=file at entry=0x0,

    help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n  -view_exact_sol   : write
exact solution vector to stdout\n  -m <mesh_x>       : number of mesh
points in x-direction\n  -n <mesh"..., ftn=ftn at entry=PETSC_FALSE,
readarguments=readarguments at entry=PETSC_FALSE, len=len at entry=0)

    at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1109

#8  0x00001555540bba82 in PetscInitialize (argc=argc at entry=0x7fffffffda8c,
args=args at entry=0x7fffffffda80, file=file at entry=0x0,

    help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n  -view_exact_sol   : write
exact solution vector to stdout\n  -m <mesh_x>       : number of mesh
points in x-direction\n  -n <mesh"...) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1274

#9  0x0000555555557673 in main (argc=<optimized out>, args=<optimized out>)
at geosimtrs_mpiserver.c:29



 Rank1-10

0x0000155553e1f030 in ompi_coll_base_allgather_intra_bruck () from
/lib/x86_64-linux-gnu/libmpi.so.40

#4  0x0000155550f62aaa in ompi_coll_tuned_allgather_intra_dec_fixed () from
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so

#5  0x0000155553ddb431 in PMPI_Allgather () from
/lib/x86_64-linux-gnu/libmpi.so.40

#6  0x00001555541a2289 in PetscLayoutSetUp (map=0x555555721ed0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/vec/is/utils/pmap.c:248

#7  0x000015555442e06a in MatMPIAIJSetPreallocationCSR_MPIAIJ
(B=0x55555572d850, Ii=0x15545a778010, J=0x15545beacb60, v=0x1554cff55e60)

    at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/mat/impls/aij/mpi/mpiaij.c:3885

#8  0x00001555544284e3 in MatMPIAIJSetPreallocationCSR (B=0x55555572d850,
i=0x15545a778010, j=0x15545beacb60, v=0x1554cff55e60) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/mat/impls/aij/mpi/mpiaij.c:3998

#9  0x0000155554b1412f in PCMPISetMat (pc=pc at entry=0x0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:250

#10 0x0000155554b17403 in PCMPIServerBegin () at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:464

#11 0x00001555540b9aa4 in PetscInitialize_Common (prog=0x7fffffffe27b
"geosimtrs_mpiserver", file=file at entry=0x0,

    help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n  -view_exact_sol   : write
exact solution vector to stdout\n  -m <mesh_x>       : number of mesh
points in x-direction\n  -n <mesh"..., ftn=ftn at entry=PETSC_FALSE,
readarguments=readarguments at entry=PETSC_FALSE, len=len at entry=0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1109

#12 0x00001555540bba82 in PetscInitialize (argc=argc at entry=0x7fffffffda8c,
args=args at entry=0x7fffffffda80, file=file at entry=0x0,

    help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n  -view_exact_sol   : write
exact solution vector to stdout\n  -m <mesh_x>       : number of mesh
points in x-direction\n  -n <mesh"...) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1274

#13 0x0000555555557673 in main (argc=<optimized out>, args=<optimized out>)
at geosimtrs_mpiserver.c:29



This did not happen in 32bit indiced PETSc, running with more than 8 cores
runs smoothly using MPI linear solver server, nor did it happen on 64 bit
indiced MPI version (not with mpi_linear_solver_server), only happens on 64
bit PETSc mpi linear solver server, I think it maybe a potential bug?



Any advice would be greatly appreciated, the matrix and ia, ja is too big
to upload, so if anything you need to debug pls let me know



   -

   Machine type: HPC
   -

   OS version and type: Linux houamd009 6.1.55-cggdb11-1 #1 SMP Fri Sep 29
   10:09:13 UTC 2023 x86_64 GNU/Linux
   -

   PETSc version: #define PETSC_VERSION_RELEASE    1
   #define PETSC_VERSION_MAJOR      3
   #define PETSC_VERSION_MINOR      20
   #define PETSC_VERSION_SUBMINOR   4
   #define PETSC_RELEASE_DATE       "Sep 28, 2023"
   #define PETSC_VERSION_DATE       "Jan 29, 2024"
   -

   MPI implementation: OpenMPI
   -

   Compiler and version: GNU



Yuxiang Lin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240904/9ebeea66/attachment.html>


More information about the petsc-users mailing list