<div dir="ltr">I can use intel MPI and compilers but still have the same problem since all the mpi programs need to run on the IBM parallel environment. <a href="http://www-03.ibm.com/systems/software/parallel/">http://www-03.ibm.com/systems/software/parallel/</a>. <br>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Oct 20, 2013 at 2:15 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
It is unfortunate IBM has perpetuated this error in their libraries and made it worse.<br>
<br>
You can, of course, work around it by making your application code far more complicated and manage the matrix assembly yourself but that is not a good use of your time or anyone's time. Plus how do you know that their bugging MPI won't bite you somewhere else, like in the new code you would need to write.<br>
<br>
You need to report this bug to IBM and they need to take it seriously, unfortunately if you are not the purchaser of the IBM you are running on they may not care (companies only care about paying customers who complain).<br>
<br>
Can you just use the Intel MPI mpi compilers/libraries? Or switch to some other system that is not from IBM? Better to just not use this machine until IBM straightens it out.<br>
<span class="HOEnZb"><font color="#888888"><br>
Barry<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
<br>
<br>
On Oct 20, 2013, at 3:04 PM, Fande Kong <<a href="mailto:fd.kong@siat.ac.cn">fd.kong@siat.ac.cn</a>> wrote:<br>
<br>
> This behaviour is really, really strange.<br>
><br>
> The yellowstone supercomputer updated the IBM PE to version 1.3.0.4 about two months ago (<a href="https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf" target="_blank">https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf</a>). I recompiled the petsc and my code as they suggested. Unfortunately, this problem reoccurs even with small the number of processors (512 cores).<br>
><br>
> The problem only happened before with large the number of processors and large the size of problem, but now this problem occurs even with small the number of processors and small problem for using IBM MPI or intel MPI.<br>
><br>
> The exactly same code can run on another supercomputer. I think the code matstash.c is really sensitive on the IBM PE. It is hard for me to fix it. Can we disable stash and then I send off-processors data by ourselves? Or can we attach a scatter to mat to exchange off-processors values?<br>
><br>
> error messages:<br>
><br>
> [76]PETSC ERROR: --------------------- Error Message ------------------------------------<br>
> [76]PETSC ERROR: Petsc has generated inconsistent data!<br>
> [76]PETSC ERROR: Negative MPI source: stash->nrecvs=5 i=7 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=371173!<br>
> [76]PETSC ERROR: ------------------------------------------------------------------------<br>
> [76]PETSC ERROR: Petsc Release Version 3.4.1, unknown<br>
> [76]PETSC ERROR: See docs/changes/index.html for recent updates.<br>
> [76]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>
> [76]PETSC ERROR: See docs/index.html for manual pages.<br>
> [76]PETSC ERROR: ------------------------------------------------------------------------<br>
> [76]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt named ys0623 by fandek Sat Oct 19 00:26:16 2013<br>
> [76]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-opt/lib<br>
> [76]PETSC ERROR: Configure run at Fri Oct 18 23:57:35 2013<br>
> [76]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=no<br>
> [76]PETSC ERROR: ------------------------------------------------------------------------<br>
> [76]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /glade/p/work/fandek/petsc/src/mat/utils/matstash.c<br>
> [76]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /glade/p/work/fandek/petsc/src/mat/impls/aij/mpi/mpiaij.c<br>
> [76]PETSC ERROR: MatAssemblyEnd() line 4939 in /glade/p/work/fandek/petsc/src/mat/interface/matrix.c<br>
> [76]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp<br>
> [76]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp<br>
> [76]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp<br>
> [76]PETSC ERROR: DMmeshInitialize() line 95 in mgInitialize.cpp<br>
> [76]PETSC ERROR: main() line 69 in linearElasticity3d.cpp<br>
> Abort(77) on node 76 (rank 76 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 77) - process 76<br>
> ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 76<br>
><br>
><br>
> Thanks<br>
><br>
><br>
><br>
> On Wed, Jun 26, 2013 at 8:56 AM, Jeff Hammond <<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a>> wrote:<br>
> This concerns IBM PE-MPI on iDataPlex, which is likely based upon the cluster implementation of PAMI, which is a completely different code base from the PAMI Blue Gene implementation. If you can reproduce it on Blue Gene/Q, I will care.<br>
><br>
> As an IBM customer, NCAR is endowed with the ability to file bug reports directly with IBM related to the products they possess. There is a link to their support system on <a href="http://www2.cisl.ucar.edu/resources/yellowstone" target="_blank">http://www2.cisl.ucar.edu/resources/yellowstone</a>, which is the appropriate channel for users of Yellowstone that have issues with the system software installed there.<br>
><br>
> Jeff<br>
><br>
> ----- Original Message -----<br>
> From: "Jed Brown" <<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>><br>
> To: "Fande Kong" <<a href="mailto:fd.kong@siat.ac.cn">fd.kong@siat.ac.cn</a>>, "petsc-users" <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>><br>
> Cc: "Jeff Hammond" <<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a>><br>
> Sent: Wednesday, June 26, 2013 9:21:48 AM<br>
> Subject: Re: [petsc-users] How to understand these error messages<br>
><br>
> Fande Kong <<a href="mailto:fd.kong@siat.ac.cn">fd.kong@siat.ac.cn</a>> writes:<br>
><br>
> > Hi Barry,<br>
> ><br>
> > If I use the intel mpi, my code can correctly run and can produce some<br>
> > correct results. Yes, you are right. The IBM MPI has some bugs.<br>
><br>
> Fande, please report this issue to the IBM.<br>
><br>
> Jeff, Fande has a reproducible case where when running on 10k cores and<br>
> problem sizes over 100M, this<br>
><br>
> MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);<br>
><br>
> returns<br>
><br>
> [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11<br>
> MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!<br>
><br>
> It runs correctly for smaller problem sizes, smaller core counts, or for<br>
> all sizes when using Intel MPI. This is on Yellowstone (iDataPlex, 4500<br>
> dx360 nodes). Do you know someone at IBM that should be notified?<br>
><br>
> --<br>
> Jeff Hammond<br>
> Argonne Leadership Computing Facility<br>
> University of Chicago Computation Institute<br>
> <a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>
> <a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>
> <a href="https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond</a><br>
> ALCF docs: <a href="http://www.alcf.anl.gov/user-guides" target="_blank">http://www.alcf.anl.gov/user-guides</a><br>
><br>
><br>
<br>
<br>
</div></div></blockquote></div><br></div>