[petsc-users] How to understand these error messages

Barry Smith bsmith at mcs.anl.gov
Sun Jun 23 23:27:55 CDT 2013


   Fande,

   We've seen trouble before with IBM on large intel systems at scale.

    From the previous configure.log you sent I see  

sh: mpicc -show
Executing: mpicc -show
sh: /ncar/opt/intel/12.1.0.233/composer_xe_2011_sp1.11.339/bin/intel64/icc   -I/glade/apps/el6/include  -I/glade/apps/el6/usr/include  -I/glade/apps/opt/netcdf/4.2/intel/default/include  -Wl,-rpath,/ncar/opt/intel/12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64  -Wl,-rpath,/ncar/opt/intel/12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/ia32  -L/glade/apps/el6/usr/lib  -L/glade/apps/el6/usr/lib64  -Wl,-rpath,/glade/apps/el6/usr/lib  -Wl,-rpath,/glade/apps/el6/usr/lib64  -L/glade/apps/opt/netcdf/4.2/intel/default/lib  -lnetcdf_c++4  -lnetcdff  -lnetcdf  -Wl,-rpath,/glade/apps/opt/netcdf/4.2/intel/default/lib  -m64 -D__64BIT__ -Wl,--allow-shlib-undefined -Wl,--enable-new-dtags -Wl,-rpath,/opt/ibmhpc/pe1209/mpich2/intel/lib64 -Wl,-rpath,/ncar/opt/intel/12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64 -I/opt/ibmhpc/pe1209/mpich2/intel/include64 -I/opt/ibmhpc/pe1209/base/include -L/opt/ibmhpc/pe1209/mpich2/intel/lib64 -lmpi -ldl -L/ncar/opt/intel/12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64 -lirc -lpthread -lrt 

  Note the -I/opt/ibmhpc/pe1209/base/include -L/opt/ibmhpc/pe1209/mpich2/intel/lib64 -lmpi   which is probably some IBM hack job of some ancient mpich2

  Now the page http://www2.cisl.ucar.edu/resources/yellowstone/software/modules-intel-dependent  has the modules


impi/4.0.3.008	This module loads the Intel MPI Library. See http://software.intel.com/en-us/intel-mpi-library/ for details.
impi/4.1.0.030	This module loads the Intel MPI Library. See http://software.intel.com/en-us/intel-mpi-library/ for details.

  Perhaps you could load those modules with the Intel compilers and avoid the IBM MPI? If that solves the problem then we know the IBM MPI is too blame.   We are interested in working with you to determine the problem.

   Barry




On Jun 23, 2013, at 9:14 PM, Fande Kong <fd.kong at siat.ac.cn> wrote:

> Thanks Barry,
> Thanks Jed,
> 
> The computer I am using is Yellowstone http://en.wikipedia.org/wiki/Yellowstone_(supercomputer), or http://www2.cisl.ucar.edu/resources/yellowstone.    The compiler is intel compiler. The mpi is IBM mpi which is a part of IBM PE.
> 
> With less unknowns (about 5 \times 10^7), the code can correctly run. With unknowns (4 \times 10^8), the code produced  the error messages.  But with  so large unknowns (4 \times 10^8), the code can also run with less cores. This is very strange.
> 
> When I switch to gnu compiler, I can not install petsc, I got the following errors:
> 
> *******************************************************************************
>          UNABLE to CONFIGURE with GIVEN OPTIONS    (see configure.log for details):
> -------------------------------------------------------------------------------
> Downloaded exodusii could not be used. Please check install in /glade/p/work/fandek/petsc/arch-linux2-cxx-opt_gnu
> *******************************************************************************
>   File "./config/configure.py", line 293, in petsc_configure
>     framework.configure(out = sys.stdout)
>   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/framework.py", line 933, in configure
>     child.configure()
>   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line 556, in configure
>     self.executeTest(self.configureLibrary)
>   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/base.py", line 115, in executeTest
>     ret = apply(test, args,kargs)
>   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/packages/exodusii.py", line 36, in configureLibrary
>     config.package.Package.configureLibrary(self)
>   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line 484, in configureLibrary
>     for location, directory, lib, incl in self.generateGuesses():
>   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line 238, in generateGuesses
>     raise RuntimeError('Downloaded '+self.package+' could not be used. Please check install in '+d+'\n')
> 
> 
> The configure.log is attached.
>        
> Regards,
> On Mon, Jun 24, 2013 at 1:03 AM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> Barry Smith <bsmith at mcs.anl.gov> writes:
> 
> >    What kind of computer system are you running? What MPI does it use? These values are nonsense MPI_SOURCE=-32766 MPI_TAG=-32766
> 
> From configure.log, this is Intel MPI.  Can you ask their support what
> this error condition is supposed to mean?  It's not clear to me that
> MPI_SOURCE or MPI_TAG contain any meaningful information (though it
> could be indicative of an internal overflow), but this value of
> MPI_ERROR should mean something.
> 
> >     Is it possible to run the code with valgrind?
> >
> >     Any chance of running the code with a different compiler?
> >
> >    Barry
> >
> >
> >
> > On Jun 23, 2013, at 4:12 AM, Fande Kong <fd.kong at siat.ac.cn> wrote:
> >
> >> Thanks Jed,
> >>
> >> I added your code into the petsc. I run my code with 10240 cores. I got the following error messages:
> >>
> >> [6724]PETSC ERROR: --------------------- Error Message ------------------------------------
> >> [6724]PETSC ERROR: Petsc has generated inconsistent data!
> >> [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!
> >> [6724]PETSC ERROR: ------------------------------------------------------------------------
> >> [6724]PETSC ERROR: Petsc Release Version 3.4.1, unknown
> >> [6724]PETSC ERROR: See docs/changes/index.html for recent updates.
> >> [6724]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> >> [6724]PETSC ERROR: See docs/index.html for manual pages.
> >> [6724]PETSC ERROR: ------------------------------------------------------------------------
> >> [6724]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-debug named ys4350 by fandek Sun Jun 23 02:58:23 2013
> >> [6724]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-debug/lib
> >> [6724]PETSC ERROR: Configure run at Sun Jun 23 00:46:05 2013
> >> [6724]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --d
> >> ownload-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=yes
> >> [6724]PETSC ERROR: ------------------------------------------------------------------------
> >> [6724]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /src/mat/utilsmatstash.c
> >> [6724]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /src/mat/impls/aij/mpimpiaij.c
> >> [6724]PETSC ERROR: MatAssemblyEnd() line 4939 in /src/mat/interfacematrix.c
> >> [6724]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp
> >> [6724]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp
> >> [6724]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
> >> [6724]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp
> >> [6724]PETSC ERROR: main() line 71 in linearElasticity3d.cpp
> >> Abort(77) on node 6724 (rank 6724 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 77) - process 6724
> >> [2921]PETSC ERROR: --------------------- Error Message ------------------------------------
> >> [2921]PETSC ERROR: Petsc has generated inconsistent data!
> >> [2921]PETSC ERROR: Negative MPI source: stash->nrecvs=15 i=3 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=3825270!
> >> [2921]PETSC ERROR: ------------------------------------------------------------------------
> >> [2921]PETSC ERROR: Petsc Release Version 3.4.1, unknown
> >> [2921]PETSC ERROR: See docs/changes/index.html for recent updates.
> >> [2921]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> >> [2921]PETSC ERROR: See docs/index.html for manual pages.
> >> [2921]PETSC ERROR: ------------------------------------------------------------------------
> >> [2921]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-debug named ys0270 by fandek Sun Jun 23 02:58:23 2013
> >> [2921]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-debug/lib
> >> [2921]PETSC ERROR: Configure run at Sun Jun 23 00:46:05 2013
> >> [2921]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=yes
> >> [2921]PETSC ERROR: ------------------------------------------------------------------------
> >> [2921]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /src/mat/utilsmatstash.c
> >> [2921]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /src/mat/impls/aij/mpimpiaij.c
> >> [2921]PETSC ERROR: MatAssemblyEnd() line 4939 in /src/mat/interfacematrix.c
> >> [2921]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp
> >> [2921]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp
> >> [2921]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
> >> [2921]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp
> >> [2921]PETSC ERROR: main() line 71 in linearElasticity3d.cpp
> >> :
> >>
> >> On Fri, Jun 21, 2013 at 4:33 AM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> >> Fande Kong <fd.kong at siat.ac.cn> writes:
> >>
> >> > The code works well with less cores. And It also works well with
> >> > petsc-3.3-p7. But it does not work with petsc-3.4.1. Thus, If you can check
> >> > the differences between petsc-3.3-p7 and petsc-3.4.1, you can figure out
> >> > the reason.
> >>
> >> That is one way to start debugging, but there are no changes to the core
> >> MatStash code, and many, many changes to PETSc in total.  The relevant
> >> snippet of code is here:
> >>
> >>     if (stash->reproduce) {
> >>       i    = stash->reproduce_count++;
> >>       ierr = MPI_Wait(stash->recv_waits+i,&recv_status);CHKERRQ(ierr);
> >>     } else {
> >>       ierr = MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);CHKERRQ(ierr);
> >>     }
> >>     if (recv_status.MPI_SOURCE < 0) SETERRQ(PETSC_COMM_SELF,PETSC_ERR_PLIB,"Negative MPI source!");
> >>
> >> So MPI returns correctly (stash->reproduce will be FALSE unless you
> >> changed it).  You could change the line above to the following:
> >>
> >>   if (recv_status.MPI_SOURCE < 0) SETERRQ5(PETSC_COMM_SELF,PETSC_ERR_PLIB,"Negative MPI source: stash->nrecvs=%D i=%d MPI_SOURCE=%d MPI_TAG=%d MPI_ERROR=%d",
> >>                                           stash->nrecvs,i,recv_status.MPI_SOURCE,recv_status.MPI_TAG,recv_status.MPI_ERROR);
> >>
> >>
> >> It would help to debug --with-debugging=1, so that more checks for
> >> corrupt data are performed.  You can still make the compiler optimize if
> >> it takes a long time to reach the error condition.
> >>
> >>
> >>
> >> --
> >> Fande Kong
> >> ShenZhen Institutes of Advanced Technology
> >> Chinese Academy of Sciences
> 
> 
> 
> -- 
> Fande Kong
> ShenZhen Institutes of Advanced Technology
> Chinese Academy of Sciences
> <configure.zip>



More information about the petsc-users mailing list