<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> Is this the first call to the MKL solver? <div class=""><br class=""></div><div class=""> Any chance the matrix entries or layout across ranks (different number of local rows) are different in the current run and the previous run?</div><div class=""><br class=""></div><div class=""> Can you save the matrix with MatView() using the binary viewer and get it to us?</div><div class=""><br class=""></div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 11, 2022, at 1:57 PM, Gong Ding <<a href="mailto:gongding@cn.cogenda.com" class="">gongding@cn.cogenda.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">
<div class="">
On 2022/8/12 01:41, Barry Smith wrote:<br class="">
<blockquote type="cite" cite="mid:F3252EC7-ECCF-4A79-A0B0-BDD7F3112999@petsc.dev" class="">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">
<div class=""><br class="">
</div>
-with-mpi-dir=/usr/local/mpich-3.4.2/
<div class=""><font class="" size="2"><br class="">
</font></div>
<div class=""><font class="" size="2"><br class="">
</font>
<div class="">
<blockquote type="cite" class="">#0 0x00007fede65066b3 in
MPIR_Barrier.part.0 () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
#1 0x00007fede65075fe in PMPI_Barrier () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
#2 0x00007fede4629409 in MKLMPI_Barrier () from
/opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so</blockquote>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">There seem to be three MPI floating around? The
libmkl_blacs_intelmpi_lp64.so seems to indicate it expects
and Intel MPI? The /usr/local/mpich/lib/libmpi.so.12 seems
to indicate that the MPI is MPICH installed at
/usr/local/mpich but the
-with-mpi-dir=/usr/local/mpich-3.4.2/ seems to indicate you
are building PETSc with yet a different MPI.<br class="">
</div>
</div>
</div>
</blockquote><p class="">I have a symbolic link /usr/local/mpich to
/usr/local/mpich-3.4.2/, its the same <br class="">
</p><p class=""><br class="">
</p>
<blockquote type="cite" cite="mid:F3252EC7-ECCF-4A79-A0B0-BDD7F3112999@petsc.dev" class="">
<div class="">
<div class="">
<div class=""><br class="">
</div>
<div class="">Given the libmkl_blacs_intelmpi_lp64.so I think you need
to start by figuring out how to do the configure using the
Intel MPI only.</div>
<div class=""><br class="">
</div>
<div class="">It is hanging because of </div>
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">#2 0x00007fede4629409 in
MKLMPI_Barrier () from
/opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so</blockquote>
<br class="">
</div>
<div class="">versus </div>
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">#2 0x00007f3537829532 in
MKLMPI_Bcast () from
/opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so</blockquote>
<br class="">
</div>
<div class="">since this is inside the MKL code I don't see why PETSc
would cause impossible configuration. Nor do I see why the
same code should run with earlier PETSc versions, are the
configure options and the resulting libraries used (MKL, MPI
etc) exactly the same for all PETSc versions you tested
with?</div>
<div class=""><br class="">
</div>
</div>
</div>
</blockquote>
I had used exactly the same configuration from 3.12 to 3.17. The
halt happens from 3.15.<br class=""><p class="">Anyway, thanks to point out that libmkl_blacs_intelmpi_lp64 may
caused the problem. I will try to debug it.<br class="">
</p><p class=""><br class="">
</p>
<blockquote type="cite" cite="mid:F3252EC7-ECCF-4A79-A0B0-BDD7F3112999@petsc.dev" class="">
<div class="">
<div class="">
<div class=""><br class="">
</div>
<blockquote type="cite" class="">
<div class="">On Aug 11, 2022, at 12:42 PM, Gong Ding <<a href="mailto:gongding@cn.cogenda.com" class="moz-txt-link-freetext" moz-do-not-send="true">gongding@cn.cogenda.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">The petsc-3.17.4 is configured on unbuntu
2022.04<br class="">
<br class="">
with gcc-9 and icc version 19.1.3.304 (gcc version 9.4.0
compatibility)<br class="">
<br class="">
Here is the config and make script<br class="">
<br class="">
source /opt/intel/bin/iccvars.sh intel64<br class="">
export PETSC_ARCH=arch-linux2-c-opt<br class="">
export MPICH_CC=icc<br class="">
export MPICH_CXX=icpc<br class="">
export MPICH_F77=ifort<br class="">
export MPICH_F90=ifort<br class="">
python configure
--with-blaslapack-include=/opt/intel/mkl/include
--with-blaslapack-lib="-L/opt/intel/mkl/lib/intel64/
-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread
-lm -ldl" --with-blacs-include=/opt/intel/mkl/include
--with-blacs-lib="-L/opt/intel/mkl/lib/intel64/
-lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64
-lmkl_sequential -lmkl_core -lpthread -lm -ldl"
--with-scalapack-include=/opt/intel/mkl/include
--with-scalapack-lib="-L/opt/intel/mkl/lib/intel64/
-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread
-lm -ldl"
--with-mkl_pardiso-include=/opt/intel/mkl/include
--with-mkl_pardiso-lib="-L/opt/intel/mkl/lib/intel64/
-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread
-lm -ldl"
--with-mkl_cpardiso-include=/opt/intel/mkl/include
--with-mkl_cpardiso-lib="-L/opt/intel/mkl/lib/intel64/
-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread
-lm -ldl" --download-mumps=1 --download-parmetis=1
--download-metis=1 --download-ptscotch=1
--with-shared-libraries=1 --with-debugging=0 --with-x=0
--with-mpi-dir=/usr/local/mpich-3.4.2/
--download-superlu=1 --download-superlu_dist=1
--download-suitesparse=1 --with-vendor-compilers=intel
COPTFLAGS="-O3 -mavx2" CXXOPTFLAGS="-O3 -mavx2"
FOPTFLAGS="-O3 -mavx2" --force<br class="">
make<br class="">
<br class="">
<br class="">
The CPARDISO is configured as<br class="">
<br class="">
ierr = KSPSetType (ksp, (char*) KSPPREONLY);
assert(!ierr);<br class="">
ierr = PCSetType (pc, (char*) PCLU); assert(!ierr);<br class="">
ierr = PCFactorSetMatSolverType (pc, "mkl_cpardiso");
assert(!ierr);<br class="">
<br class="">
"-mat_mkl_cpardiso_2" "2" // "Fill-in reducing
ordering for the input matrix", 2 for nested dissection<br class="">
"-mat_mkl_cpardiso_8" "20" // "Iterative refinement
step"<br class="">
"-mat_mkl_cpardiso_13" "1" // "Improved accuracy
using (non-) symmetric weighted matching"<br class="">
<br class="">
<br class="">
When run the linear solver in parallel, with -info
argument,<br class="">
<br class="">
the code will last print<br class="">
<br class="">
[0] PCSetUp(): Setting up PC for first time<br class="">
<br class="">
and run into endless loop<br class="">
<br class="">
If run the code with -n 2, with two process 411465 and
411466<br class="">
<br class="">
gdb attach shows<br class="">
<br class="">
(gdb) attach 411465<br class="">
Attaching to process 411465<br class="">
[New LWP 411469]<br class="">
[New LWP 411470]<br class="">
[Thread debugging using libthread_db enabled]<br class="">
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".<br class="">
0x00007fede65066b3 in MPIR_Barrier.part.0 () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
(gdb) bt<br class="">
#0 0x00007fede65066b3 in MPIR_Barrier.part.0 () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
#1 0x00007fede65075fe in PMPI_Barrier () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
#2 0x00007fede4629409 in MKLMPI_Barrier () from
/opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so<br class="">
#3 0x00007fede0732b65 in
mkl_pds_lp64_cpardiso_mpi_barrier () from
/opt/intel/mkl/lib/intel64/libmkl_core.so<br class="">
#4 0x00007feddfdb1a62 in
mkl_pds_lp64_cluster_sparse_solver () from
/opt/intel/mkl/lib/intel64/libmkl_core.so<br class="">
#5 0x00007fede7d7f689 in MatFactorNumeric_MKL_CPARDISO
() from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#6 0x00007fede7912f2d in MatLUFactorNumeric () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#7 0x00007fede839da46 in PCSetUp_LU () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#8 0x00007fede846fd82 in PCSetUp () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#9 0x00007fede84b1e5a in KSPSetUp () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#10 0x00007fede84adc89 in KSPSolve_Private () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#11 0x00007fede84b66c0 in KSPSolve () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#12 0x00007fede86174a3 in SNESSolve_NEWTONLS () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#13 0x00007fede85c62d8 in SNESSolve () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
<br class="">
<br class="">
(gdb) attach 411466<br class="">
Attaching to process 411466<br class="">
[New LWP 411467]<br class="">
[New LWP 411468]<br class="">
[Thread debugging using libthread_db enabled]<br class="">
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".<br class="">
0x00007f3539638583 in MPIR_Bcast.part.0 () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
(gdb) bt<br class="">
#0 0x00007f3539638583 in MPIR_Bcast.part.0 () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
#1 0x00007f3539639b90 in PMPI_Bcast () from
/usr/local/mpich/lib/libmpi.so.12<br class="">
#2 0x00007f3537829532 in MKLMPI_Bcast () from
/opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so<br class="">
#3 0x00007f3532faa561 in mkl_pds_lp64_factorize_slave
() from /opt/intel/mkl/lib/intel64/libmkl_core.so<br class="">
#4 0x00007f3532fb16be in
mkl_pds_lp64_cluster_sparse_solver () from
/opt/intel/mkl/lib/intel64/libmkl_core.so<br class="">
#5 0x00007f353aeac689 in MatFactorNumeric_MKL_CPARDISO
() from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#6 0x00007f353aa3ff2d in MatLUFactorNumeric () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#7 0x00007f353b4caa46 in PCSetUp_LU () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#8 0x00007f353b59cd82 in PCSetUp () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#9 0x00007f353b5dee5a in KSPSetUp () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#10 0x00007f353b5dac89 in KSPSolve_Private () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#11 0x00007f353b5e36c0 in KSPSolve () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#12 0x00007f353b7444a3 in SNESSolve_NEWTONLS () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
#13 0x00007f353b6f32d8 in SNESSolve () from
/usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17<br class="">
<br class="">
<br class="">
Hope above information is enough.<br class="">
<br class="">
<br class="">
Gong Ding<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
On 2022/8/11 23:00, Barry Smith wrote:<br class="">
<blockquote type="cite" class=""> What do you mean
halt? Does it hang, seemingly running forever with no
output, does it crash and print an error message
(please send the entire error message; cut and paste).
Is the matrix something you can share so we can try to
reproduce?<br class="">
<br class="">
Barry<br class="">
<br class="">
<br class="">
<blockquote type="cite" class="">On Aug 11, 2022, at
5:42 AM, Gong Ding <<a href="mailto:gongding@cn.cogenda.com" class="moz-txt-link-freetext" moz-do-not-send="true">gongding@cn.cogenda.com</a>>
wrote:<br class="">
<br class="">
Hi petsc developer,<br class="">
<br class="">
MKL (version 20200004) cpardiso halt on parallel
environment since petsc 3.15.<br class="">
<br class="">
we tested that 3.12 and 3.14 works but 3.15~3.17
(latest) halt.<br class="">
<br class="">
Dose anyone meet the same trouble?<br class="">
<br class="">
Gong Ding<br class="">
<br class="">
<br class="">
<br class="">
</blockquote>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
</div>
</div></blockquote></div><br class=""></div></body></html>