[petsc-users] How to understand these error messages

Fande Kong fd.kong at siat.ac.cn
Mon Jun 24 21:43:10 CDT 2013


Hi Barry,

How to use valgrind to debug parallel program on the supercomputer with
many cores? If we follow the instruction "mpiexec -n NPROC valgrind
--tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p
PETSCPROGRAMNAME -malloc off PROGRAMOPTIONS",  for 10000 cores, 10000 files
will be printed. Maybe we need to put all information into a single file.
How to do this?

On Mon, Jun 24, 2013 at 9:33 PM, Peter Lichtner <peter.lichtner at gmail.com>wrote:

> Just in case this helps I use Yellowstone for running PFLOTRAN with both
> gcc and intel compilers using the developer version of PETSc. My
> configuration script reads for intel:
>
> ./config/configure.py --with-cc=mpicc --with-fc=mpif90 --with-cxx=mpicxx
> --with-clanguage=c --with-blas-lapack-dir=$BLAS_LAPACK_LIB_DIR
> --with-shared-libraries=0 --with-debugging=0 --download-hdf5=yes
> --download-parmetis=yes --download-metis=yes
>
> echo $BLAS_LAPACK_LIB_DIR
> /ncar/opt/intel/12.1.0.233/composer_xe_2013.1.117/mkl
>
> module load cmake/2.8.10.2
>
> Intel was a little faster compared to gcc.
>
> ...Peter
>
> On Jun 24, 2013, at 1:53 AM, Fande Kong <fd.kong at siat.ac.cn> wrote:
>
> Hi Barry,
>
> I switched to gnu compiler. I also got the similar results:
>
>
> [330]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> [330]PETSC ERROR: Petsc has generated inconsistent data!
> [330]PETSC ERROR: Negative MPI source: stash->nrecvs=27 i=33
> MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=5243744!
> [330]PETSC ERROR:
> ------------------------------------------------------------------------
> [330]PETSC ERROR: Petsc Release Version 3.4.1, unknown
> [330]PETSC ERROR: See docs/changes/index.html for recent updates.
> [330]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> [330]PETSC ERROR: See docs/index.html for manual pages.
> [330]PETSC ERROR:
> ------------------------------------------------------------------------
> [330]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt_gnu named
> ys0554 by fandek Mon Jun 24 01:42:37 2013
> [330]PETSC ERROR: Libraries linked from
> /glade/p/work/fandek/petsc/arch-linux2-cxx-opt_gnu/lib
> [330]PETSC ERROR: Configure run at Mon Jun 24 00:34:40 2013
> [330]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx
> --with-shared-libraries=1 --with-dynamic-loading=1
> --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1
> --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1
> --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1
> --with-debugging=no
> [330]PETSC ERROR:
> ------------------------------------------------------------------------
> [330]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in
> /src/mat/utilsmatstash.c
> [330]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in
> /src/mat/impls/aij/mpimpiaij.c
> [330]PETSC ERROR: MatAssemblyEnd() line 4939 in /src/mat/interfacematrix.c
> [330]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp
> [330]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in
> meshreorder.cpp
> [330]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
> [330]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp
> [330]PETSC ERROR: main() line 71 in linearElasticity3d.cpp
>
>
>
> Thus, I think that it has nothing to do with the compiler.
>
>
> On Sun, Jun 23, 2013 at 11:45 PM, Fande Kong <fd.kong at siat.ac.cn> wrote:
>
>> Thanks Barry,
>>
>> I will try impi.
>>
>> I have another question. In the previous email, you said If I can change
>> to use another compiler. Why I need to change the compiler?
>>
>>
>> On Mon, Jun 24, 2013 at 12:27 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>>
>>>    Fande,
>>>
>>>    We've seen trouble before with IBM on large intel systems at scale.
>>>
>>>     From the previous configure.log you sent I see
>>>
>>> sh: mpicc -show
>>> Executing: mpicc -show
>>> sh: /ncar/opt/intel/
>>> 12.1.0.233/composer_xe_2011_sp1.11.339/bin/intel64/icc
>>> -I/glade/apps/el6/include  -I/glade/apps/el6/usr/include
>>>  -I/glade/apps/opt/netcdf/4.2/intel/default/include
>>>  -Wl,-rpath,/ncar/opt/intel/
>>> 12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64 -Wl,-rpath,/ncar/opt/intel/
>>> 12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/ia32 -L/glade/apps/el6/usr/lib  -L/glade/apps/el6/usr/lib64
>>>  -Wl,-rpath,/glade/apps/el6/usr/lib  -Wl,-rpath,/glade/apps/el6/usr/lib64
>>>  -L/glade/apps/opt/netcdf/4.2/intel/default/lib  -lnetcdf_c++4  -lnetcdff
>>>  -lnetcdf  -Wl,-rpath,/glade/apps/opt/netcdf/4.2/intel/default/lib  -m64
>>> -D__64BIT__ -Wl,--allow-shlib-undefined -Wl,--enable-new-dtags
>>> -Wl,-rpath,/opt/ibmhpc/pe1209/mpich2/intel/lib64 -Wl,-rpath,/ncar/opt/intel/
>>> 12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64-I/opt/ibmhpc/pe1209/mpich2/intel/include64
>>> -I/opt/ibmhpc/pe1209/base/include -L/opt/ibmhpc/pe1209/mpich2/intel/lib64
>>> -lmpi -ldl -L/ncar/opt/intel/
>>> 12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64 -lirc
>>> -lpthread -lrt
>>>
>>>   Note the -I/opt/ibmhpc/pe1209/base/include
>>> -L/opt/ibmhpc/pe1209/mpich2/intel/lib64 -lmpi   which is probably some IBM
>>> hack job of some ancient mpich2
>>>
>>>   Now the page
>>> http://www2.cisl.ucar.edu/resources/yellowstone/software/modules-intel-dependent has the modules
>>>
>>>
>>> impi/4.0.3.008  This module loads the Intel MPI Library. See
>>> http://software.intel.com/en-us/intel-mpi-library/ for details.
>>> impi/4.1.0.030  This module loads the Intel MPI Library. See
>>> http://software.intel.com/en-us/intel-mpi-library/ for details.
>>>
>>>   Perhaps you could load those modules with the Intel compilers and
>>> avoid the IBM MPI? If that solves the problem then we know the IBM MPI is
>>> too blame.   We are interested in working with you to determine the problem.
>>>
>>>    Barry
>>>
>>>
>>>
>>>
>>> On Jun 23, 2013, at 9:14 PM, Fande Kong <fd.kong at siat.ac.cn> wrote:
>>>
>>> > Thanks Barry,
>>> > Thanks Jed,
>>> >
>>> > The computer I am using is Yellowstone
>>> http://en.wikipedia.org/wiki/Yellowstone_(supercomputer), or
>>> http://www2.cisl.ucar.edu/resources/yellowstone.    The compiler is
>>> intel compiler. The mpi is IBM mpi which is a part of IBM PE.
>>> >
>>> > With less unknowns (about 5 \times 10^7), the code can correctly run.
>>> With unknowns (4 \times 10^8), the code produced  the error messages.  But
>>> with  so large unknowns (4 \times 10^8), the code can also run with less
>>> cores. This is very strange.
>>> >
>>> > When I switch to gnu compiler, I can not install petsc, I got the
>>> following errors:
>>> >
>>> >
>>> *******************************************************************************
>>> >          UNABLE to CONFIGURE with GIVEN OPTIONS    (see configure.log
>>> for details):
>>> >
>>> -------------------------------------------------------------------------------
>>> > Downloaded exodusii could not be used. Please check install in
>>> /glade/p/work/fandek/petsc/arch-linux2-cxx-opt_gnu
>>> >
>>> *******************************************************************************
>>> >   File "./config/configure.py", line 293, in petsc_configure
>>> >     framework.configure(out = sys.stdout)
>>> >   File
>>> "/glade/p/work/fandek/petsc/config/BuildSystem/config/framework.py", line
>>> 933, in configure
>>> >     child.configure()
>>> >   File
>>> "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line
>>> 556, in configure
>>> >     self.executeTest(self.configureLibrary)
>>> >   File "/glade/p/work/fandek/petsc/config/BuildSystem/config/base.py",
>>> line 115, in executeTest
>>> >     ret = apply(test, args,kargs)
>>> >   File
>>> "/glade/p/work/fandek/petsc/config/BuildSystem/config/packages/exodusii.py",
>>> line 36, in configureLibrary
>>> >     config.package.Package.configureLibrary(self)
>>> >   File
>>> "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line
>>> 484, in configureLibrary
>>> >     for location, directory, lib, incl in self.generateGuesses():
>>> >   File
>>> "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line
>>> 238, in generateGuesses
>>> >     raise RuntimeError('Downloaded '+self.package+' could not be used.
>>> Please check install in '+d+'\n')
>>> >
>>> >
>>> > The configure.log is attached.
>>> >
>>> > Regards,
>>> > On Mon, Jun 24, 2013 at 1:03 AM, Jed Brown <jedbrown at mcs.anl.gov>
>>> wrote:
>>> > Barry Smith <bsmith at mcs.anl.gov> writes:
>>> >
>>> > >    What kind of computer system are you running? What MPI does it
>>> use? These values are nonsense MPI_SOURCE=-32766 MPI_TAG=-32766
>>> >
>>> > From configure.log, this is Intel MPI.  Can you ask their support what
>>> > this error condition is supposed to mean?  It's not clear to me that
>>> > MPI_SOURCE or MPI_TAG contain any meaningful information (though it
>>> > could be indicative of an internal overflow), but this value of
>>> > MPI_ERROR should mean something.
>>> >
>>> > >     Is it possible to run the code with valgrind?
>>> > >
>>> > >     Any chance of running the code with a different compiler?
>>> > >
>>> > >    Barry
>>> > >
>>> > >
>>> > >
>>> > > On Jun 23, 2013, at 4:12 AM, Fande Kong <fd.kong at siat.ac.cn> wrote:
>>> > >
>>> > >> Thanks Jed,
>>> > >>
>>> > >> I added your code into the petsc. I run my code with 10240 cores. I
>>> got the following error messages:
>>> > >>
>>> > >> [6724]PETSC ERROR: --------------------- Error Message
>>> ------------------------------------
>>> > >> [6724]PETSC ERROR: Petsc has generated inconsistent data!
>>> > >> [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11
>>> MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!
>>> > >> [6724]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> > >> [6724]PETSC ERROR: Petsc Release Version 3.4.1, unknown
>>> > >> [6724]PETSC ERROR: See docs/changes/index.html for recent updates.
>>> > >> [6724]PETSC ERROR: See docs/faq.html for hints about trouble
>>> shooting.
>>> > >> [6724]PETSC ERROR: See docs/index.html for manual pages.
>>> > >> [6724]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> > >> [6724]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-debug
>>> named ys4350 by fandek Sun Jun 23 02:58:23 2013
>>> > >> [6724]PETSC ERROR: Libraries linked from
>>> /glade/p/work/fandek/petsc/arch-linux2-cxx-debug/lib
>>> > >> [6724]PETSC ERROR: Configure run at Sun Jun 23 00:46:05 2013
>>> > >> [6724]PETSC ERROR: Configure options --with-valgrind=1
>>> --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1
>>> --download-f-blas-lapack=1 --with-mpi=1 --d
>>> > >> ownload-parmetis=1 --download-metis=1 --with-64-bit-indices=1
>>> --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1
>>> --download-hdf5=1 --with-debugging=yes
>>> > >> [6724]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> > >> [6724]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in
>>> /src/mat/utilsmatstash.c
>>> > >> [6724]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in
>>> /src/mat/impls/aij/mpimpiaij.c
>>> > >> [6724]PETSC ERROR: MatAssemblyEnd() line 4939 in
>>> /src/mat/interfacematrix.c
>>> > >> [6724]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in
>>> meshreorder.cpp
>>> > >> [6724]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in
>>> meshreorder.cpp
>>> > >> [6724]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
>>> > >> [6724]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp
>>> > >> [6724]PETSC ERROR: main() line 71 in linearElasticity3d.cpp
>>> > >> Abort(77) on node 6724 (rank 6724 in comm 1140850688): application
>>> called MPI_Abort(MPI_COMM_WORLD, 77) - process 6724
>>> > >> [2921]PETSC ERROR: --------------------- Error Message
>>> ------------------------------------
>>> > >> [2921]PETSC ERROR: Petsc has generated inconsistent data!
>>> > >> [2921]PETSC ERROR: Negative MPI source: stash->nrecvs=15 i=3
>>> MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=3825270!
>>> > >> [2921]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> > >> [2921]PETSC ERROR: Petsc Release Version 3.4.1, unknown
>>> > >> [2921]PETSC ERROR: See docs/changes/index.html for recent updates.
>>> > >> [2921]PETSC ERROR: See docs/faq.html for hints about trouble
>>> shooting.
>>> > >> [2921]PETSC ERROR: See docs/index.html for manual pages.
>>> > >> [2921]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> > >> [2921]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-debug
>>> named ys0270 by fandek Sun Jun 23 02:58:23 2013
>>> > >> [2921]PETSC ERROR: Libraries linked from
>>> /glade/p/work/fandek/petsc/arch-linux2-cxx-debug/lib
>>> > >> [2921]PETSC ERROR: Configure run at Sun Jun 23 00:46:05 2013
>>> > >> [2921]PETSC ERROR: Configure options --with-valgrind=1
>>> --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1
>>> --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1
>>> --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1
>>> --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1
>>> --with-debugging=yes
>>> > >> [2921]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> > >> [2921]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in
>>> /src/mat/utilsmatstash.c
>>> > >> [2921]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in
>>> /src/mat/impls/aij/mpimpiaij.c
>>> > >> [2921]PETSC ERROR: MatAssemblyEnd() line 4939 in
>>> /src/mat/interfacematrix.c
>>> > >> [2921]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in
>>> meshreorder.cpp
>>> > >> [2921]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in
>>> meshreorder.cpp
>>> > >> [2921]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
>>> > >> [2921]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp
>>> > >> [2921]PETSC ERROR: main() line 71 in linearElasticity3d.cpp
>>> > >> :
>>> > >>
>>> > >> On Fri, Jun 21, 2013 at 4:33 AM, Jed Brown <jedbrown at mcs.anl.gov>
>>> wrote:
>>> > >> Fande Kong <fd.kong at siat.ac.cn> writes:
>>> > >>
>>> > >> > The code works well with less cores. And It also works well with
>>> > >> > petsc-3.3-p7. But it does not work with petsc-3.4.1. Thus, If you
>>> can check
>>> > >> > the differences between petsc-3.3-p7 and petsc-3.4.1, you can
>>> figure out
>>> > >> > the reason.
>>> > >>
>>> > >> That is one way to start debugging, but there are no changes to the
>>> core
>>> > >> MatStash code, and many, many changes to PETSc in total.  The
>>> relevant
>>> > >> snippet of code is here:
>>> > >>
>>> > >>     if (stash->reproduce) {
>>> > >>       i    = stash->reproduce_count++;
>>> > >>       ierr =
>>> MPI_Wait(stash->recv_waits+i,&recv_status);CHKERRQ(ierr);
>>> > >>     } else {
>>> > >>       ierr =
>>> MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);CHKERRQ(ierr);
>>> > >>     }
>>> > >>     if (recv_status.MPI_SOURCE < 0)
>>> SETERRQ(PETSC_COMM_SELF,PETSC_ERR_PLIB,"Negative MPI source!");
>>> > >>
>>> > >> So MPI returns correctly (stash->reproduce will be FALSE unless you
>>> > >> changed it).  You could change the line above to the following:
>>> > >>
>>> > >>   if (recv_status.MPI_SOURCE < 0)
>>> SETERRQ5(PETSC_COMM_SELF,PETSC_ERR_PLIB,"Negative MPI source:
>>> stash->nrecvs=%D i=%d MPI_SOURCE=%d MPI_TAG=%d MPI_ERROR=%d",
>>> > >>
>>> stash->nrecvs,i,recv_status.MPI_SOURCE,recv_status.MPI_TAG,recv_status.MPI_ERROR);
>>> > >>
>>> > >>
>>> > >> It would help to debug --with-debugging=1, so that more checks for
>>> > >> corrupt data are performed.  You can still make the compiler
>>> optimize if
>>> > >> it takes a long time to reach the error condition.
>>> > >>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Fande Kong
>>> > >> ShenZhen Institutes of Advanced Technology
>>> > >> Chinese Academy of Sciences
>>> >
>>> >
>>> >
>>> > --
>>> > Fande Kong
>>> > ShenZhen Institutes of Advanced Technology
>>> > Chinese Academy of Sciences
>>> > <configure.zip>
>>>
>>>
>>>
>>
>>
>> --
>> Fande Kong
>> ShenZhen Institutes of Advanced Technology
>> Chinese Academy of Sciences
>>
>
>
>
> --
> Fande Kong
> ShenZhen Institutes of Advanced Technology
> Chinese Academy of Sciences
>
>
> ________________
> Peter Lichtner
> Santa Fe, NM 87507
> (505) 692-4029 (c)
> OFM Research/LANL Guest Scientist
>
>


-- 
Fande Kong
ShenZhen Institutes of Advanced Technology
Chinese Academy of Sciences
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130625/779625d5/attachment-0001.html>


More information about the petsc-users mailing list