<div dir="ltr"><div><div><div><div>This behaviour is really, really strange.<br><br></div>The yellowstone supercomputer updated the IBM PE to version 1.3.0.4 about two months ago (<a href="https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf">https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf</a>). I recompiled the petsc and my code as they suggested. Unfortunately, this problem reoccurs even with small the number of processors (512 cores).<br>
<br></div>The problem only happened before with large the number of processors and large the size of problem, but now this problem occurs even with small the number of processors and small problem for using IBM MPI or intel MPI.<br>
<br></div>The exactly same code can run on another supercomputer. I think the code matstash.c is really sensitive on the IBM PE. It is hard for me to fix it. Can we disable stash and then I send off-processors data by ourselves? Or can we attach a scatter to mat to exchange off-processors values?<br>
<br></div>error messages:<br><br>[76]PETSC ERROR: --------------------- Error Message ------------------------------------<br>[76]PETSC ERROR: Petsc has generated inconsistent data!<br>[76]PETSC ERROR: Negative MPI source: stash->nrecvs=5 i=7 MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=371173!<br>
[76]PETSC ERROR: ------------------------------------------------------------------------<br>[76]PETSC ERROR: Petsc Release Version 3.4.1, unknown <br>[76]PETSC ERROR: See docs/changes/index.html for recent updates.<br>[76]PETSC ERROR: See docs/faq.html for hints about trouble shooting.<br>
[76]PETSC ERROR: See docs/index.html for manual pages.<br>[76]PETSC ERROR: ------------------------------------------------------------------------<br>[76]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt named ys0623 by fandek Sat Oct 19 00:26:16 2013<br>
[76]PETSC ERROR: Libraries linked from /glade/p/work/fandek/petsc/arch-linux2-cxx-opt/lib<br>[76]PETSC ERROR: Configure run at Fri Oct 18 23:57:35 2013<br>[76]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1 --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 --with-debugging=no<br>
[76]PETSC ERROR: ------------------------------------------------------------------------<br>[76]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in /glade/p/work/fandek/petsc/src/mat/utils/matstash.c<br>[76]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in /glade/p/work/fandek/petsc/src/mat/impls/aij/mpi/mpiaij.c<br>
[76]PETSC ERROR: MatAssemblyEnd() line 4939 in /glade/p/work/fandek/petsc/src/mat/interface/matrix.c<br>[76]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp<br>[76]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in meshreorder.cpp<br>
[76]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp<br>[76]PETSC ERROR: DMmeshInitialize() line 95 in mgInitialize.cpp<br>[76]PETSC ERROR: main() line 69 in linearElasticity3d.cpp<br>Abort(77) on node 76 (rank 76 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 77) - process 76<br>
ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 76<br><div><br><br><div><div><div><div><div class="gmail_extra">Thanks <br></div><div class="gmail_extra"><br><br><br><div class="gmail_quote">On Wed, Jun 26, 2013 at 8:56 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This concerns IBM PE-MPI on iDataPlex, which is likely based upon the cluster implementation of PAMI, which is a completely different code base from the PAMI Blue Gene implementation. If you can reproduce it on Blue Gene/Q, I will care.<br>
<br>
As an IBM customer, NCAR is endowed with the ability to file bug reports directly with IBM related to the products they possess. There is a link to their support system on <a href="http://www2.cisl.ucar.edu/resources/yellowstone" target="_blank">http://www2.cisl.ucar.edu/resources/yellowstone</a>, which is the appropriate channel for users of Yellowstone that have issues with the system software installed there.<br>
<br>
Jeff<br>
<div><div><br>
----- Original Message -----<br>
From: "Jed Brown" <<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>><br>
To: "Fande Kong" <<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>>, "petsc-users" <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
Cc: "Jeff Hammond" <<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>><br>
Sent: Wednesday, June 26, 2013 9:21:48 AM<br>
Subject: Re: [petsc-users] How to understand these error messages<br>
<br>
Fande Kong <<a href="mailto:fd.kong@siat.ac.cn" target="_blank">fd.kong@siat.ac.cn</a>> writes:<br>
<br>
> Hi Barry,<br>
><br>
> If I use the intel mpi, my code can correctly run and can produce some<br>
> correct results. Yes, you are right. The IBM MPI has some bugs.<br>
<br>
Fande, please report this issue to the IBM.<br>
<br>
Jeff, Fande has a reproducible case where when running on 10k cores and<br>
problem sizes over 100M, this<br>
<br>
MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);<br>
<br>
returns<br>
<br>
[6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11<br>
MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!<br>
<br>
It runs correctly for smaller problem sizes, smaller core counts, or for<br>
all sizes when using Intel MPI. This is on Yellowstone (iDataPlex, 4500<br>
dx360 nodes). Do you know someone at IBM that should be notified?<br>
<br>
</div></div><span><font color="#888888">--<br>
Jeff Hammond<br>
Argonne Leadership Computing Facility<br>
University of Chicago Computation Institute<br>
<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381" target="_blank">(630) 252-5381</a><br>
<a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>
<a href="https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond</a><br>
ALCF docs: <a href="http://www.alcf.anl.gov/user-guides" target="_blank">http://www.alcf.anl.gov/user-guides</a><br>
<br>
</font></span></blockquote></div><br></div></div></div></div></div></div></div>