[petsc-users] MPI_AllReduce error with -xcore-avx2 flags

Jose E. Roman jroman at dsic.upv.es
Thu Jan 28 04:18:25 CST 2016


> El 28 ene 2016, a las 9:13, Bikash Kanungo <bikash at umich.edu> escribió:
> 
> Hi Jose,
> 
> Here is the complete error message:
> 
> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
> [0]PETSC ERROR: Invalid argument
> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
> [0]PETSC ERROR: Unknown Name on a intel-openmpi_ib named comet-03-60.sdsc.edu by bikashk Thu Jan 28 00:09:17 2016
> [0]PETSC ERROR: Configure options CFLAGS="-fPIC -xcore-avx2" FFLAGS="-fPIC -xcore-avx2" CXXFLAGS="-fPIC -xcore-avx2" --prefix=/opt/petsc/intel/openmpi_ib --with-mpi=true --download-pastix=../pastix_5.2.2.12.tar.bz2 --download-ptscotch=../scotch_6.0.0_esmumps.tar.gz --with-blas-lib="-Wl,--start-group /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_intel_lp64.a            /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_sequential.a /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_core.a            -Wl,--end-group -lpthread -lm" --with-lapack-lib="-Wl,--start-group /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_intel_lp64.a            /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_sequential.a /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_core.a            -Wl,--end-group -lpthread -lm" --with-superlu_dist-include=/opt/superlu/intel/openmpi_ib/include --with-superlu_dist-lib="-L/opt/superlu/intel/openmpi_ib/lib -lsuperlu" --with-parmetis-dir=/opt/parmetis/intel/openmpi_ib --with-metis-dir=/opt/parmetis/intel/openmpi_ib --with-mpi-dir=/opt/openmpi/intel/ib --with-scalapack-dir=/opt/scalapack/intel/openmpi_ib --download-mumps=../MUMPS_4.10.0-p3.tar.gz --download-blacs=../blacs-dev.tar.gz --download-fblaslapack=../fblaslapack-3.4.2.tar.gz --with-pic=true --with-shared-libraries=1 --with-hdf5=true --with-hdf5-dir=/opt/hdf5/intel/openmpi_ib --with-debugging=false
> [0]PETSC ERROR: #1 BVScaleColumn() line 380 in /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvops.c
> [0]PETSC ERROR: #2 BVOrthogonalize_GS() line 474 in /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvorthog.c
> [0]PETSC ERROR: #3 BVOrthogonalize() line 535 in /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvorthog.c
> [comet-03-60:27927] *** Process received signal ***
> [comet-03-60:27927] Signal: Aborted (6)
> 
> 

Here are some comments:
- These kind of errors appear only in debugging mode. I don't know why you are getting them since you have --with-debugging=false
- The flag -xcore-avx2 enables fused multiply-add (FMA) instructions, which means you get slightly more accurate floating-point results. This could explain why you get different behaviour with/without this flag.
- The argument of BVScaleColumn() is guaranteed to be the same in all processes, so the only explanation is that it has become a NaN. [Note that in petsc-master (and hence petsc-3.7) NaN's no longer trigger this error.]
- My conclusion is that your column vectors of the BV object are not linearly independent, so eventually the vector norm is (almost) zero. The error will appear only if the computed value is exactly zero.

In summary: BVOrthogonalize() is new in SLEPc, and it is not very well tested. In particular, linearly dependent vectors are not handled well. For the next release I will add code to take into account rank-deficient BV's. In the meantime, you may want to try running with '-bv_orthog_block chol' (it uses a different orthogonalization algorithm).

Jose



More information about the petsc-users mailing list