[petsc-users] CPARDISO halt on petsc3.17

Barry Smith bsmith at petsc.dev
Thu Aug 11 13:04:18 CDT 2022


  Is this the first call to the MKL solver? 

  Any chance the matrix entries or layout across ranks (different number of local rows) are different in the current run and the previous run?

  Can you save the matrix with MatView() using the binary viewer and get it to us?



> On Aug 11, 2022, at 1:57 PM, Gong Ding <gongding at cn.cogenda.com> wrote:
> 
> On 2022/8/12 01:41, Barry Smith wrote:
>> 
>> -with-mpi-dir=/usr/local/mpich-3.4.2/ 
>> 
>> 
>>> #0  0x00007fede65066b3 in MPIR_Barrier.part.0 () from /usr/local/mpich/lib/libmpi.so.12
>>> #1  0x00007fede65075fe in PMPI_Barrier () from /usr/local/mpich/lib/libmpi.so.12
>>> #2  0x00007fede4629409 in MKLMPI_Barrier () from /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so
>> 
>> 
>> There seem to be three MPI floating around? The libmkl_blacs_intelmpi_lp64.so seems to indicate it expects and Intel MPI? The /usr/local/mpich/lib/libmpi.so.12 seems to indicate that the MPI is MPICH installed at /usr/local/mpich but the -with-mpi-dir=/usr/local/mpich-3.4.2/  seems to indicate you are building PETSc with yet a different MPI.
> I have a symbolic link  /usr/local/mpich to  /usr/local/mpich-3.4.2/, its the same 
> 
> 
> 
>> 
>> Given the libmkl_blacs_intelmpi_lp64.so I think you need to start by figuring out how to do the configure using the Intel MPI only.
>> 
>> It is hanging because of 
>> 
>>> #2  0x00007fede4629409 in MKLMPI_Barrier () from /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so
>> 
>> versus 
>> 
>>> #2  0x00007f3537829532 in MKLMPI_Bcast () from /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so
>> 
>> since this is inside the MKL code I don't see why PETSc would cause impossible configuration. Nor do I see why the same code should run with earlier PETSc versions, are the configure options and the resulting libraries used (MKL, MPI etc) exactly the same for all PETSc versions you tested with?
>> 
> I had used exactly the same configuration from 3.12 to 3.17.   The halt happens from 3.15.
> Anyway, thanks to point out that libmkl_blacs_intelmpi_lp64 may caused the problem. I will try to debug it.
> 
> 
> 
>> 
>>> On Aug 11, 2022, at 12:42 PM, Gong Ding <gongding at cn.cogenda.com <mailto:gongding at cn.cogenda.com>> wrote:
>>> 
>>> The petsc-3.17.4 is configured on unbuntu 2022.04
>>> 
>>> with gcc-9 and icc version 19.1.3.304 (gcc version 9.4.0 compatibility)
>>> 
>>> Here is the config and make script
>>> 
>>> source /opt/intel/bin/iccvars.sh intel64
>>> export PETSC_ARCH=arch-linux2-c-opt
>>> export MPICH_CC=icc
>>> export MPICH_CXX=icpc
>>> export MPICH_F77=ifort
>>> export MPICH_F90=ifort
>>> python configure  --with-blaslapack-include=/opt/intel/mkl/include --with-blaslapack-lib="-L/opt/intel/mkl/lib/intel64/ -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl" --with-blacs-include=/opt/intel/mkl/include --with-blacs-lib="-L/opt/intel/mkl/lib/intel64/ -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl" --with-scalapack-include=/opt/intel/mkl/include --with-scalapack-lib="-L/opt/intel/mkl/lib/intel64/ -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl" --with-mkl_pardiso-include=/opt/intel/mkl/include --with-mkl_pardiso-lib="-L/opt/intel/mkl/lib/intel64/ -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl" --with-mkl_cpardiso-include=/opt/intel/mkl/include --with-mkl_cpardiso-lib="-L/opt/intel/mkl/lib/intel64/ -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -ldl" --download-mumps=1 --download-parmetis=1 --download-metis=1 --download-ptscotch=1 --with-shared-libraries=1 --with-debugging=0 --with-x=0 --with-mpi-dir=/usr/local/mpich-3.4.2/ --download-superlu=1 --download-superlu_dist=1 --download-suitesparse=1   --with-vendor-compilers=intel COPTFLAGS="-O3 -mavx2" CXXOPTFLAGS="-O3 -mavx2" FOPTFLAGS="-O3 -mavx2" --force
>>> make
>>> 
>>> 
>>> The CPARDISO is configured as
>>> 
>>> ierr = KSPSetType (ksp, (char*) KSPPREONLY); assert(!ierr);
>>> ierr = PCSetType  (pc, (char*) PCLU); assert(!ierr);
>>> ierr = PCFactorSetMatSolverType (pc, "mkl_cpardiso"); assert(!ierr);
>>> 
>>>  "-mat_mkl_cpardiso_2"  "2"     // "Fill-in reducing ordering for the input matrix", 2 for nested dissection
>>>  "-mat_mkl_cpardiso_8"  "20"  // "Iterative refinement step"
>>>  "-mat_mkl_cpardiso_13"  "1"  // "Improved accuracy using (non-) symmetric weighted matching"
>>> 
>>> 
>>> When run the linear solver in parallel, with -info argument,
>>> 
>>> the code will last print
>>> 
>>> [0] PCSetUp(): Setting up PC for first time
>>> 
>>> and run into endless loop
>>> 
>>> If run the code with -n 2, with two process  411465 and 411466
>>> 
>>> gdb attach shows
>>> 
>>> (gdb) attach 411465
>>> Attaching to process 411465
>>> [New LWP 411469]
>>> [New LWP 411470]
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>> 0x00007fede65066b3 in MPIR_Barrier.part.0 () from /usr/local/mpich/lib/libmpi.so.12
>>> (gdb) bt
>>> #0  0x00007fede65066b3 in MPIR_Barrier.part.0 () from /usr/local/mpich/lib/libmpi.so.12
>>> #1  0x00007fede65075fe in PMPI_Barrier () from /usr/local/mpich/lib/libmpi.so.12
>>> #2  0x00007fede4629409 in MKLMPI_Barrier () from /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so
>>> #3  0x00007fede0732b65 in mkl_pds_lp64_cpardiso_mpi_barrier () from /opt/intel/mkl/lib/intel64/libmkl_core.so
>>> #4  0x00007feddfdb1a62 in mkl_pds_lp64_cluster_sparse_solver () from /opt/intel/mkl/lib/intel64/libmkl_core.so
>>> #5  0x00007fede7d7f689 in MatFactorNumeric_MKL_CPARDISO () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #6  0x00007fede7912f2d in MatLUFactorNumeric () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #7  0x00007fede839da46 in PCSetUp_LU () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #8  0x00007fede846fd82 in PCSetUp () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #9  0x00007fede84b1e5a in KSPSetUp () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #10 0x00007fede84adc89 in KSPSolve_Private () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #11 0x00007fede84b66c0 in KSPSolve () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #12 0x00007fede86174a3 in SNESSolve_NEWTONLS () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #13 0x00007fede85c62d8 in SNESSolve () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> 
>>> 
>>> (gdb) attach 411466
>>> Attaching to process 411466
>>> [New LWP 411467]
>>> [New LWP 411468]
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>> 0x00007f3539638583 in MPIR_Bcast.part.0 () from /usr/local/mpich/lib/libmpi.so.12
>>> (gdb) bt
>>> #0  0x00007f3539638583 in MPIR_Bcast.part.0 () from /usr/local/mpich/lib/libmpi.so.12
>>> #1  0x00007f3539639b90 in PMPI_Bcast () from /usr/local/mpich/lib/libmpi.so.12
>>> #2  0x00007f3537829532 in MKLMPI_Bcast () from /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so
>>> #3  0x00007f3532faa561 in mkl_pds_lp64_factorize_slave () from /opt/intel/mkl/lib/intel64/libmkl_core.so
>>> #4  0x00007f3532fb16be in mkl_pds_lp64_cluster_sparse_solver () from /opt/intel/mkl/lib/intel64/libmkl_core.so
>>> #5  0x00007f353aeac689 in MatFactorNumeric_MKL_CPARDISO () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #6  0x00007f353aa3ff2d in MatLUFactorNumeric () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #7  0x00007f353b4caa46 in PCSetUp_LU () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #8  0x00007f353b59cd82 in PCSetUp () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #9  0x00007f353b5dee5a in KSPSetUp () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #10 0x00007f353b5dac89 in KSPSolve_Private () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #11 0x00007f353b5e36c0 in KSPSolve () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #12 0x00007f353b7444a3 in SNESSolve_NEWTONLS () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> #13 0x00007f353b6f32d8 in SNESSolve () from /usr/local/petsc-3.17.4/arch-linux2-c-opt/lib/libpetsc.so.3.17
>>> 
>>> 
>>> Hope above information is enough.
>>> 
>>> 
>>> Gong Ding
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 2022/8/11 23:00, Barry Smith wrote:
>>>>   What do you mean halt? Does it hang, seemingly running forever with no output, does it crash and print an error message (please send the entire error message; cut and paste). Is the matrix something you can share so we can try to reproduce?
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>>> On Aug 11, 2022, at 5:42 AM, Gong Ding <gongding at cn.cogenda.com <mailto:gongding at cn.cogenda.com>> wrote:
>>>>> 
>>>>> Hi petsc developer,
>>>>> 
>>>>> MKL (version 20200004) cpardiso halt on parallel environment since petsc 3.15.
>>>>> 
>>>>> we tested that 3.12 and 3.14 works  but 3.15~3.17 (latest) halt.
>>>>> 
>>>>> Dose anyone meet the same trouble?
>>>>> 
>>>>> Gong Ding
>>>>> 
>>>>> 
>>>>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220811/78fc4d88/attachment-0001.html>


More information about the petsc-users mailing list