[petsc-users] MPI_AllReduce error with -xcore-avx2 flags

Bikash Kanungo bikash at umich.edu
Thu Jan 28 04:45:44 CST 2016


Yeah I suspected linear dependence. But I was puzzled by the error
occurring in one machine and not the other. But even on the machine that it
failed, it failed for some runs and passed successfully for others. So it
suggests that the vector norm is almost zero in certain cases (i.e, in the
runs that survive) and zero in others (i.e., the runs that fail). I'll use
-bv_orthog_block chol to see if the error persists.

Thanks a ton, Jose.

Regards,
Bikash

On Thu, Jan 28, 2016 at 5:18 AM, Jose E. Roman <jroman at dsic.upv.es> wrote:

>
> > El 28 ene 2016, a las 9:13, Bikash Kanungo <bikash at umich.edu> escribió:
> >
> > Hi Jose,
> >
> > Here is the complete error message:
> >
> > [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> > [0]PETSC ERROR: Invalid argument
> > [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
> > [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> > [0]PETSC ERROR: Petsc Release Version 3.5.2, Sep, 08, 2014
> > [0]PETSC ERROR: Unknown Name on a intel-openmpi_ib named
> comet-03-60.sdsc.edu by bikashk Thu Jan 28 00:09:17 2016
> > [0]PETSC ERROR: Configure options CFLAGS="-fPIC -xcore-avx2"
> FFLAGS="-fPIC -xcore-avx2" CXXFLAGS="-fPIC -xcore-avx2"
> --prefix=/opt/petsc/intel/openmpi_ib --with-mpi=true
> --download-pastix=../pastix_5.2.2.12.tar.bz2
> --download-ptscotch=../scotch_6.0.0_esmumps.tar.gz
> --with-blas-lib="-Wl,--start-group
> /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_intel_lp64.a
>
> /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_sequential.a
> /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_core.a
>     -Wl,--end-group -lpthread -lm" --with-lapack-lib="-Wl,--start-group
> /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_intel_lp64.a
>
> /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_sequential.a
> /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_core.a
>     -Wl,--end-group -lpthread -lm"
> --with-superlu_dist-include=/opt/superlu/intel/openmpi_ib/include
> --with-superlu_dist-lib="-L/opt/superlu/intel/openmpi_ib/lib -lsuperlu"
> --with-parmetis-dir=/opt/parmetis/intel/openmpi_ib
> --with-metis-dir=/opt/parmetis/intel/openmpi_ib
> --with-mpi-dir=/opt/openmpi/intel/ib
> --with-scalapack-dir=/opt/scalapack/intel/openmpi_ib
> --download-mumps=../MUMPS_4.10.0-p3.tar.gz
> --download-blacs=../blacs-dev.tar.gz
> --download-fblaslapack=../fblaslapack-3.4.2.tar.gz --with-pic=true
> --with-shared-libraries=1 --with-hdf5=true
> --with-hdf5-dir=/opt/hdf5/intel/openmpi_ib --with-debugging=false
> > [0]PETSC ERROR: #1 BVScaleColumn() line 380 in
> /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvops.c
> > [0]PETSC ERROR: #2 BVOrthogonalize_GS() line 474 in
> /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvorthog.c
> > [0]PETSC ERROR: #3 BVOrthogonalize() line 535 in
> /scratch/build/git/math-roll/BUILD/sdsc-slepc_intel_openmpi_ib-3.5.3/slepc-3.5.3/src/sys/classes/bv/interface/bvorthog.c
> > [comet-03-60:27927] *** Process received signal ***
> > [comet-03-60:27927] Signal: Aborted (6)
> >
> >
>
> Here are some comments:
> - These kind of errors appear only in debugging mode. I don't know why you
> are getting them since you have --with-debugging=false
> - The flag -xcore-avx2 enables fused multiply-add (FMA) instructions,
> which means you get slightly more accurate floating-point results. This
> could explain why you get different behaviour with/without this flag.
> - The argument of BVScaleColumn() is guaranteed to be the same in all
> processes, so the only explanation is that it has become a NaN. [Note that
> in petsc-master (and hence petsc-3.7) NaN's no longer trigger this error.]
> - My conclusion is that your column vectors of the BV object are not
> linearly independent, so eventually the vector norm is (almost) zero. The
> error will appear only if the computed value is exactly zero.
>
> In summary: BVOrthogonalize() is new in SLEPc, and it is not very well
> tested. In particular, linearly dependent vectors are not handled well. For
> the next release I will add code to take into account rank-deficient BV's.
> In the meantime, you may want to try running with '-bv_orthog_block chol'
> (it uses a different orthogonalization algorithm).
>
> Jose
>
>


-- 
Bikash S. Kanungo
PhD Student
Computational Materials Physics Group
Mechanical Engineering
University of Michigan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160128/2394b47b/attachment.html>


More information about the petsc-users mailing list