Hi Barry,<div><br></div><div>How to use valgrind to debug parallel program on the supercomputer with many cores? If we follow the instruction "mpiexec -n NPROC valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p PETSCPROGRAMNAME -malloc off PROGRAMOPTIONS", for 10000 cores, 10000 files will be printed. Maybe we need to put all information into a single file. How to do this?</div>
<div><br><div class="gmail_quote">On Mon, Jun 24, 2013 at 9:33 PM, Peter Lichtner <span dir="ltr"><<a href="mailto:peter.lichtner@gmail.com" target="_blank">peter.lichtner@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word">Just in case this helps I use Yellowstone for running PFLOTRAN with both gcc and intel compilers using the developer version of PETSc. My configuration script reads for intel:<div><br></div>
<div><div>./config/configure.py --with-cc=mpicc --with-fc=mpif90 --with-cxx=mpicxx --with-clanguage=c --with-blas-lapack-dir=$BLAS_LAPACK_LIB_DIR --with-shared-libraries=0 --with-debugging=0 --download-hdf5=yes --download-parmetis=yes --download-metis=yes</div>
<div><br></div><div><div><div>echo $BLAS_LAPACK_LIB_DIR</div><div>/ncar/opt/intel/<a href="http://12.1.0.233/composer_xe_2013.1.117/mkl" target="_blank">12.1.0.233/composer_xe_2013.1.117/mkl</a></div></div></div><div><br>
</div><div><div>module load cmake/<a href="http://2.8.10.2" target="_blank">2.8.10.2</a></div></div><div><br></div><div>Intel was a little faster compared to gcc.</div><div><br></div><div>...Peter</div><div><div class="h5">
<div><br></div><div><div>On Jun 24, 2013, at 1:53 AM, Fande Kong <<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>> wrote:</div><br><blockquote type="cite">Hi Barry,<br><br>I switched to gnu compiler. I also got the similar results:<br>
<br><br>[330]PETSC ERROR: --------------------- Error Message ------------------------------------<br>[330]PETSC ERROR: Petsc has generated inconsistent data!<br>
[330]PETSC ERROR: Negative MPI source: stash->nrecvs=27 i=33 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=5243744!<br>[330]PETSC ERROR: ------------------------------------------------------------------------<br>[330]PETSC ERROR: Petsc Release Version 3.4.1, unknown <br>
[330]PETSC ERROR: See docs/changes/index.html for recent updates.<br>[330]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>[330]PETSC ERROR: See docs/index.html for manual pages.<br>[330]PETSC ERROR: ------------------------------------------------------------------------<br>
[330]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt_gnu named ys0554 by fandek Mon Jun 24 01:42:37 2013<br>[330]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-opt_gnu/lib<br>[330]PETSC ERROR: Configure run at Mon Jun 24 00:34:40 2013<br>
[330]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=no<br>
[330]PETSC ERROR: ------------------------------------------------------------------------<br>[330]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /src/mat/utilsmatstash.c<br>[330]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /src/mat/impls/aij/mpimpiaij.c<br>
[330]PETSC ERROR: MatAssemblyEnd() line 4939 in /src/mat/interfacematrix.c<br>[330]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp<br>[330]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp<br>
[330]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp<br>[330]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp<br>[330]PETSC ERROR: main() line 71 in linearElasticity3d.cpp<br><br><br><br>Thus, I think that it has nothing to do with the compiler.<br>
<br><br><div class="gmail_quote">On Sun, Jun 23, 2013 at 11:45 PM, Fande Kong <span dir="ltr"><<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>Thanks Barry,</div><div><br></div>I will try impi. <div><br></div><div>I have another question. In the previous email, you said If I can change to use another compiler. Why I need to change the compiler?<div><div>
<br><br><div class="gmail_quote">
On Mon, Jun 24, 2013 at 12:27 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Fande,<br>
<br>
We've seen trouble before with IBM on large intel systems at scale.<br>
<br>
From the previous configure.log you sent I see<br>
<br>
sh: mpicc -show<br>
Executing: mpicc -show<br>
sh: /ncar/opt/intel/<a href="http://12.1.0.233/composer_xe_2011_sp1.11.339/bin/intel64/icc" target="_blank">12.1.0.233/composer_xe_2011_sp1.11.339/bin/intel64/icc</a> -I/glade/apps/el6/include -I/glade/apps/el6/usr/include -I/glade/apps/opt/netcdf/4.2/intel/default/include -Wl,-rpath,/ncar/opt/intel/<a href="http://12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64" target="_blank">12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64</a> -Wl,-rpath,/ncar/opt/intel/<a href="http://12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/ia32" target="_blank">12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/ia32</a> -L/glade/apps/el6/usr/lib -L/glade/apps/el6/usr/lib64 -Wl,-rpath,/glade/apps/el6/usr/lib -Wl,-rpath,/glade/apps/el6/usr/lib64 -L/glade/apps/opt/netcdf/4.2/intel/default/lib -lnetcdf_c++4 -lnetcdff -lnetcdf -Wl,-rpath,/glade/apps/opt/netcdf/4.2/intel/default/lib -m64 -D__64BIT__ -Wl,--allow-shlib-undefined -Wl,--enable-new-dtags -Wl,-rpath,/opt/ibmhpc/pe1209/mpich2/intel/lib64 -Wl,-rpath,/ncar/opt/intel/<a href="http://12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64" target="_blank">12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64</a> -I/opt/ibmhpc/pe1209/mpich2/intel/include64 -I/opt/ibmhpc/pe1209/base/include -L/opt/ibmhpc/pe1209/mpich2/intel/lib64 -lmpi -ldl -L/ncar/opt/intel/<a href="http://12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64" target="_blank">12.1.0.233/composer_xe_2011_sp1.11.339/compiler/lib/intel64</a> -lirc -lpthread -lrt<br>
<br>
Note the -I/opt/ibmhpc/pe1209/base/include -L/opt/ibmhpc/pe1209/mpich2/intel/lib64 -lmpi which is probably some IBM hack job of some ancient mpich2<br>
<br>
Now the page <a href="http://www2.cisl.ucar.edu/resources/yellowstone/software/modules-intel-dependent" target="_blank">http://www2.cisl.ucar.edu/resources/yellowstone/software/modules-intel-dependent</a> has the modules<br>
<br>
<br>
impi/<a href="http://4.0.3.008/" target="_blank">4.0.3.008</a> This module loads the Intel MPI Library. See <a href="http://software.intel.com/en-us/intel-mpi-library/" target="_blank">http://software.intel.com/en-us/intel-mpi-library/</a> for details.<br>
impi/<a href="http://4.1.0.030/" target="_blank">4.1.0.030</a> This module loads the Intel MPI Library. See <a href="http://software.intel.com/en-us/intel-mpi-library/" target="_blank">http://software.intel.com/en-us/intel-mpi-library/</a> for details.<br>
<br>
Perhaps you could load those modules with the Intel compilers and avoid the IBM MPI? If that solves the problem then we know the IBM MPI is too blame. We are interested in working with you to determine the problem.<br>
<br>
Barry<br>
<div><br>
<br>
<br>
<br>
On Jun 23, 2013, at 9:14 PM, Fande Kong <<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>> wrote:<br>
<br>
> Thanks Barry,<br>
> Thanks Jed,<br>
><br>
> The computer I am using is Yellowstone <a href="http://en.wikipedia.org/wiki/Yellowstone_(supercomputer)" target="_blank">http://en.wikipedia.org/wiki/Yellowstone_(supercomputer)</a>, or <a href="http://www2.cisl.ucar.edu/resources/yellowstone" target="_blank">http://www2.cisl.ucar.edu/resources/yellowstone</a>. The compiler is intel compiler. The mpi is IBM mpi which is a part of IBM PE.<br>
><br>
> With less unknowns (about 5 \times 10^7), the code can correctly run. With unknowns (4 \times 10^8), the code produced the error messages. But with so large unknowns (4 \times 10^8), the code can also run with less cores. This is very strange.<br>
><br>
> When I switch to gnu compiler, I can not install petsc, I got the following errors:<br>
><br>
> *******************************************************************************<br>
> UNABLE to CONFIGURE with GIVEN OPTIONS (see configure.log for details):<br>
> -------------------------------------------------------------------------------<br>
> Downloaded exodusii could not be used. Please check install in /glade/p/work/fandek/petsc/arch-linux2-cxx-opt_gnu<br>
> *******************************************************************************<br>
> File "./config/configure.py", line 293, in petsc_configure<br>
> framework.configure(out = sys.stdout)<br>
> File "/glade/p/work/fandek/petsc/config/BuildSystem/config/framework.py", line 933, in configure<br>
> child.configure()<br>
> File "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line 556, in configure<br>
> self.executeTest(self.configureLibrary)<br>
> File "/glade/p/work/fandek/petsc/config/BuildSystem/config/base.py", line 115, in executeTest<br>
> ret = apply(test, args,kargs)<br>
> File "/glade/p/work/fandek/petsc/config/BuildSystem/config/packages/exodusii.py", line 36, in configureLibrary<br>
> config.package.Package.configureLibrary(self)<br>
> File "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line 484, in configureLibrary<br>
> for location, directory, lib, incl in self.generateGuesses():<br>
> File "/glade/p/work/fandek/petsc/config/BuildSystem/config/package.py", line 238, in generateGuesses<br>
> raise RuntimeError('Downloaded '+self.package+' could not be used. Please check install in '+d+'\n')<br>
><br>
><br>
> The configure.log is attached.<br>
><br>
> Regards,<br>
> On Mon, Jun 24, 2013 at 1:03 AM, Jed Brown <<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>> wrote:<br>
> Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>> writes:<br>
><br>
> > What kind of computer system are you running? What MPI does it use? These values are nonsense MPI_SOURCE=-32766 MPI_TAG=-32766<br>
><br>
> From configure.log, this is Intel MPI. Can you ask their support what<br>
> this error condition is supposed to mean? It's not clear to me that<br>
> MPI_SOURCE or MPI_TAG contain any meaningful information (though it<br>
> could be indicative of an internal overflow), but this value of<br>
> MPI_ERROR should mean something.<br>
><br>
> > Is it possible to run the code with valgrind?<br>
> ><br>
> > Any chance of running the code with a different compiler?<br>
> ><br>
> > Barry<br>
> ><br>
> ><br>
> ><br>
> > On Jun 23, 2013, at 4:12 AM, Fande Kong <<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>> wrote:<br>
> ><br>
> >> Thanks Jed,<br>
> >><br>
> >> I added your code into the petsc. I run my code with 10240 cores. I got the following error messages:<br>
> >><br>
> >> [6724]PETSC ERROR: --------------------- Error Message ------------------------------------<br>
> >> [6724]PETSC ERROR: Petsc has generated inconsistent data!<br>
> >> [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!<br>
> >> [6724]PETSC ERROR: ------------------------------------------------------------------------<br>
> >> [6724]PETSC ERROR: Petsc Release Version 3.4.1, unknown<br>
> >> [6724]PETSC ERROR: See docs/changes/index.html for recent updates.<br>
> >> [6724]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>
> >> [6724]PETSC ERROR: See docs/index.html for manual pages.<br>
> >> [6724]PETSC ERROR: ------------------------------------------------------------------------<br>
> >> [6724]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-debug named ys4350 by fandek Sun Jun 23 02:58:23 2013<br>
> >> [6724]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-debug/lib<br>
> >> [6724]PETSC ERROR: Configure run at Sun Jun 23 00:46:05 2013<br>
> >> [6724]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --d<br>
> >> ownload-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=yes<br>
> >> [6724]PETSC ERROR: ------------------------------------------------------------------------<br>
> >> [6724]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /src/mat/utilsmatstash.c<br>
> >> [6724]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /src/mat/impls/aij/mpimpiaij.c<br>
> >> [6724]PETSC ERROR: MatAssemblyEnd() line 4939 in /src/mat/interfacematrix.c<br>
> >> [6724]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp<br>
> >> [6724]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp<br>
> >> [6724]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp<br>
> >> [6724]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp<br>
> >> [6724]PETSC ERROR: main() line 71 in linearElasticity3d.cpp<br>
> >> Abort(77) on node 6724 (rank 6724 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 77) - process 6724<br>
> >> [2921]PETSC ERROR: --------------------- Error Message ------------------------------------<br>
> >> [2921]PETSC ERROR: Petsc has generated inconsistent data!<br>
> >> [2921]PETSC ERROR: Negative MPI source: stash->nrecvs=15 i=3 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=3825270!<br>
> >> [2921]PETSC ERROR: ------------------------------------------------------------------------<br>
> >> [2921]PETSC ERROR: Petsc Release Version 3.4.1, unknown<br>
> >> [2921]PETSC ERROR: See docs/changes/index.html for recent updates.<br>
> >> [2921]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>
> >> [2921]PETSC ERROR: See docs/index.html for manual pages.<br>
> >> [2921]PETSC ERROR: ------------------------------------------------------------------------<br>
> >> [2921]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-debug named ys0270 by fandek Sun Jun 23 02:58:23 2013<br>
> >> [2921]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-debug/lib<br>
> >> [2921]PETSC ERROR: Configure run at Sun Jun 23 00:46:05 2013<br>
> >> [2921]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=yes<br>
> >> [2921]PETSC ERROR: ------------------------------------------------------------------------<br>
> >> [2921]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /src/mat/utilsmatstash.c<br>
> >> [2921]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /src/mat/impls/aij/mpimpiaij.c<br>
> >> [2921]PETSC ERROR: MatAssemblyEnd() line 4939 in /src/mat/interfacematrix.c<br>
> >> [2921]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp<br>
> >> [2921]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp<br>
> >> [2921]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp<br>
> >> [2921]PETSC ERROR: DMmeshInitialize() line 78 in mgInitialize.cpp<br>
> >> [2921]PETSC ERROR: main() line 71 in linearElasticity3d.cpp<br>
> >> :<br>
> >><br>
> >> On Fri, Jun 21, 2013 at 4:33 AM, Jed Brown <<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>> wrote:<br>
> >> Fande Kong <<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>> writes:<br>
> >><br>
> >> > The code works well with less cores. And It also works well with<br>
> >> > petsc-3.3-p7. But it does not work with petsc-3.4.1. Thus, If you can check<br>
> >> > the differences between petsc-3.3-p7 and petsc-3.4.1, you can figure out<br>
> >> > the reason.<br>
> >><br>
> >> That is one way to start debugging, but there are no changes to the core<br>
> >> MatStash code, and many, many changes to PETSc in total. The relevant<br>
> >> snippet of code is here:<br>
> >><br>
> >> if (stash->reproduce) {<br>
> >> i = stash->reproduce_count++;<br>
> >> ierr = MPI_Wait(stash->recv_waits+i,&recv_status);CHKERRQ(ierr);<br>
> >> } else {<br>
> >> ierr = MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);CHKERRQ(ierr);<br>
> >> }<br>
> >> if (recv_status.MPI_SOURCE < 0) SETERRQ(PETSC_COMM_SELF,PETSC_ERR_PLIB,"Negative MPI source!");<br>
> >><br>
> >> So MPI returns correctly (stash->reproduce will be FALSE unless you<br>
> >> changed it). You could change the line above to the following:<br>
> >><br>
> >> if (recv_status.MPI_SOURCE < 0) SETERRQ5(PETSC_COMM_SELF,PETSC_ERR_PLIB,"Negative MPI source: stash->nrecvs=%D i=%d MPI_SOURCE=%d MPI_TAG=%d MPI_ERROR=%d",<br>
> >> stash->nrecvs,i,recv_status.MPI_SOURCE,recv_status.MPI_TAG,recv_status.MPI_ERROR);<br>
> >><br>
> >><br>
> >> It would help to debug --with-debugging=1, so that more checks for<br>
> >> corrupt data are performed. You can still make the compiler optimize if<br>
> >> it takes a long time to reach the error condition.<br>
> >><br>
> >><br>
> >><br>
> >> --<br>
> >> Fande Kong<br>
> >> ShenZhen Institutes of Advanced Technology<br>
> >> Chinese Academy of Sciences<br>
><br>
><br>
><br>
> --<br>
> Fande Kong<br>
> ShenZhen Institutes of Advanced Technology<br>
> Chinese Academy of Sciences<br>
</div>> <configure.zip><br>
<br>
<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div style="line-height:21px;font-size:14px;font-family:Verdana">Fande Kong</div><div style="line-height:21px;font-size:14px;font-family:Verdana">
ShenZhen Institutes of Advanced Technology</div><div style="line-height:21px;font-size:14px;font-family:Verdana">Chinese Academy of Sciences</div>
</div></div></div>
</blockquote></div><br><br clear="all"><br>-- <br><div style="line-height:21px;font-size:14px;font-family:Verdana">Fande Kong</div><div style="line-height:21px;font-size:14px;font-family:Verdana">
ShenZhen Institutes of Advanced Technology</div><div style="line-height:21px;font-size:14px;font-family:Verdana">Chinese Academy of Sciences</div>
</blockquote></div><br></div></div><div>
<span style="border-collapse:separate;border-spacing:0px"><span style="text-indent:0px;letter-spacing:normal;font-variant:normal;text-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:normal;border-collapse:separate;text-transform:none;font-size:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div style="word-wrap:break-word">
<span style="text-indent:0px;letter-spacing:normal;font-variant:normal;text-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:normal;border-collapse:separate;text-transform:none;font-size:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div style="word-wrap:break-word">
<div>________________</div><span class="HOEnZb"><font color="#888888"><div>Peter Lichtner</div><div>Santa Fe, NM 87507</div><div><a href="tel:%28505%29%20692-4029" value="+15056924029" target="_blank">(505) 692-4029</a> (c)</div>
<div>OFM Research/LANL Guest Scientist</div></font></span></div></span></div></span></span>
</div>
<br></div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div style="line-height:21px;font-family:Verdana;font-size:14px;background-color:rgb(255,255,255)">Fande Kong</div><div style="line-height:21px;font-family:Verdana;font-size:14px;background-color:rgb(255,255,255)">
ShenZhen Institutes of Advanced Technology</div><div style="line-height:21px;font-family:Verdana;font-size:14px;background-color:rgb(255,255,255)">Chinese Academy of Sciences</div>
</div>