[petsc-users] Bug or mis-use for 64 indices PETSc mpi linear solver server with more than 8 cores
Lin_Yuxiang
linyx199071 at gmail.com
Wed Sep 4 14:52:59 CDT 2024
To whom it may concern:
I recently tried to use the 64 indices PETSc to replace the legacy code's
solver using MPI linear solver server. However, it gives me error when I
use more than 8 cores, saying
Get NNZ
MatsetPreallocation
MatsetValue
MatSetValue Time per kernel: 43.1147 s
Matassembly
VecsetValue
pestc_solve
Read -1, expected 1951397280, errno = 14
When I tried the -start_in_debugger, the error seems from MPI_Scatter:
Rank0:
#3 0x00001555512e4de5 in mca_pml_ob1_recv () from
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so
#4 0x0000155553e01e60 in PMPI_Scatterv () from
/lib/x86_64-linux-gnu/libmpi.so.40
#5 0x0000155554b13eab in PCMPISetMat (pc=pc at entry=0x0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:230
#6 0x0000155554b17403 in PCMPIServerBegin () at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:464
#7 0x00001555540b9aa4 in PetscInitialize_Common (prog=0x7fffffffe27b
"geosimtrs_mpiserver", file=file at entry=0x0,
help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n -view_exact_sol : write
exact solution vector to stdout\n -m <mesh_x> : number of mesh
points in x-direction\n -n <mesh"..., ftn=ftn at entry=PETSC_FALSE,
readarguments=readarguments at entry=PETSC_FALSE, len=len at entry=0)
at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1109
#8 0x00001555540bba82 in PetscInitialize (argc=argc at entry=0x7fffffffda8c,
args=args at entry=0x7fffffffda80, file=file at entry=0x0,
help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n -view_exact_sol : write
exact solution vector to stdout\n -m <mesh_x> : number of mesh
points in x-direction\n -n <mesh"...) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1274
#9 0x0000555555557673 in main (argc=<optimized out>, args=<optimized out>)
at geosimtrs_mpiserver.c:29
Rank1-10
0x0000155553e1f030 in ompi_coll_base_allgather_intra_bruck () from
/lib/x86_64-linux-gnu/libmpi.so.40
#4 0x0000155550f62aaa in ompi_coll_tuned_allgather_intra_dec_fixed () from
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#5 0x0000155553ddb431 in PMPI_Allgather () from
/lib/x86_64-linux-gnu/libmpi.so.40
#6 0x00001555541a2289 in PetscLayoutSetUp (map=0x555555721ed0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/vec/is/utils/pmap.c:248
#7 0x000015555442e06a in MatMPIAIJSetPreallocationCSR_MPIAIJ
(B=0x55555572d850, Ii=0x15545a778010, J=0x15545beacb60, v=0x1554cff55e60)
at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/mat/impls/aij/mpi/mpiaij.c:3885
#8 0x00001555544284e3 in MatMPIAIJSetPreallocationCSR (B=0x55555572d850,
i=0x15545a778010, j=0x15545beacb60, v=0x1554cff55e60) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/mat/impls/aij/mpi/mpiaij.c:3998
#9 0x0000155554b1412f in PCMPISetMat (pc=pc at entry=0x0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:250
#10 0x0000155554b17403 in PCMPIServerBegin () at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/ksp/pc/impls/mpi/pcmpi.c:464
#11 0x00001555540b9aa4 in PetscInitialize_Common (prog=0x7fffffffe27b
"geosimtrs_mpiserver", file=file at entry=0x0,
help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n -view_exact_sol : write
exact solution vector to stdout\n -m <mesh_x> : number of mesh
points in x-direction\n -n <mesh"..., ftn=ftn at entry=PETSC_FALSE,
readarguments=readarguments at entry=PETSC_FALSE, len=len at entry=0) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1109
#12 0x00001555540bba82 in PetscInitialize (argc=argc at entry=0x7fffffffda8c,
args=args at entry=0x7fffffffda80, file=file at entry=0x0,
help=help at entry=0x55555555a1e0 <help> "Solves a linear system in
parallel with KSP.\nInput parameters include:\n -view_exact_sol : write
exact solution vector to stdout\n -m <mesh_x> : number of mesh
points in x-direction\n -n <mesh"...) at
/auto/research/rdfs/home/lyuxiang/petsc-3.20.4/src/sys/objects/pinit.c:1274
#13 0x0000555555557673 in main (argc=<optimized out>, args=<optimized out>)
at geosimtrs_mpiserver.c:29
This did not happen in 32bit indiced PETSc, running with more than 8 cores
runs smoothly using MPI linear solver server, nor did it happen on 64 bit
indiced MPI version (not with mpi_linear_solver_server), only happens on 64
bit PETSc mpi linear solver server, I think it maybe a potential bug?
Any advice would be greatly appreciated, the matrix and ia, ja is too big
to upload, so if anything you need to debug pls let me know
-
Machine type: HPC
-
OS version and type: Linux houamd009 6.1.55-cggdb11-1 #1 SMP Fri Sep 29
10:09:13 UTC 2023 x86_64 GNU/Linux
-
PETSc version: #define PETSC_VERSION_RELEASE 1
#define PETSC_VERSION_MAJOR 3
#define PETSC_VERSION_MINOR 20
#define PETSC_VERSION_SUBMINOR 4
#define PETSC_RELEASE_DATE "Sep 28, 2023"
#define PETSC_VERSION_DATE "Jan 29, 2024"
-
MPI implementation: OpenMPI
-
Compiler and version: GNU
Yuxiang Lin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240904/9ebeea66/attachment.html>
More information about the petsc-users
mailing list