[petsc-users] How to understand these error messages

Fande Kong fd.kong at siat.ac.cn
Sun Oct 20 15:04:39 CDT 2013


This behaviour is really, really strange.

The yellowstone supercomputer updated the IBM PE to version 1.3.0.4 about
two months ago (
https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf).
I recompiled the petsc and my code as they suggested. Unfortunately, this
problem reoccurs even with small the number of processors (512 cores).

The problem only happened before with large the number of processors and
large the size of problem, but now this problem occurs even with small the
number of processors and small problem for using IBM MPI or intel MPI.

The exactly same code can run on another supercomputer. I think the code
matstash.c is really sensitive on the IBM PE. It is hard for me to fix it.
Can we disable stash and then I send off-processors data by ourselves? Or
can we attach a scatter to mat to exchange off-processors values?

error messages:

[76]PETSC ERROR: --------------------- Error Message
------------------------------------
[76]PETSC ERROR: Petsc has generated inconsistent data!
[76]PETSC ERROR: Negative MPI source: stash->nrecvs=5 i=7 MPI_SOURCE=-32766
MPI_TAG=-32766 MPI_ERROR=371173!
[76]PETSC ERROR:
------------------------------------------------------------------------
[76]PETSC ERROR: Petsc Release Version 3.4.1, unknown
[76]PETSC ERROR: See docs/changes/index.html for recent updates.
[76]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[76]PETSC ERROR: See docs/index.html for manual pages.
[76]PETSC ERROR:
------------------------------------------------------------------------
[76]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt named ys0623
by fandek Sat Oct 19 00:26:16 2013
[76]PETSC ERROR: Libraries linked from
/glade/p/work/fandek/petsc/arch-linux2-cxx-opt/lib
[76]PETSC ERROR: Configure run at Fri Oct 18 23:57:35 2013
[76]PETSC ERROR: Configure options --with-valgrind=1 --with-clanguage=cxx
--with-shared-libraries=1 --with-dynamic-loading=1
--download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1
--download-metis=1 --with-64-bit-indices=1 --download-netcdf=1
--download-exodusii=1 --download-ptscotch=1 --download-hdf5=1
--with-debugging=no
[76]PETSC ERROR:
------------------------------------------------------------------------
[76]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in
/glade/p/work/fandek/petsc/src/mat/utils/matstash.c
[76]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in
/glade/p/work/fandek/petsc/src/mat/impls/aij/mpi/mpiaij.c
[76]PETSC ERROR: MatAssemblyEnd() line 4939 in
/glade/p/work/fandek/petsc/src/mat/interface/matrix.c
[76]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in meshreorder.cpp
[76]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in
meshreorder.cpp
[76]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
[76]PETSC ERROR: DMmeshInitialize() line 95 in mgInitialize.cpp
[76]PETSC ERROR: main() line 69 in linearElasticity3d.cpp
Abort(77) on node 76 (rank 76 in comm 1140850688): application called
MPI_Abort(MPI_COMM_WORLD, 77) - process 76
ERROR: 0031-300  Forcing all remote tasks to exit due to exit code 1 in
task 76


Thanks



On Wed, Jun 26, 2013 at 8:56 AM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:

> This concerns IBM PE-MPI on iDataPlex, which is likely based upon the
> cluster implementation of PAMI, which is a completely different code base
> from the PAMI Blue Gene implementation.  If you can reproduce it on Blue
> Gene/Q, I will care.
>
> As an IBM customer, NCAR is endowed with the ability to file bug reports
> directly with IBM related to the products they possess.  There is a link to
> their support system on http://www2.cisl.ucar.edu/resources/yellowstone,
> which is the appropriate channel for users of Yellowstone that have issues
> with the system software installed there.
>
> Jeff
>
> ----- Original Message -----
> From: "Jed Brown" <jedbrown at mcs.anl.gov>
> To: "Fande Kong" <fd.kong at siat.ac.cn>, "petsc-users" <
> petsc-users at mcs.anl.gov>
> Cc: "Jeff Hammond" <jhammond at alcf.anl.gov>
> Sent: Wednesday, June 26, 2013 9:21:48 AM
> Subject: Re: [petsc-users] How to understand these error messages
>
> Fande Kong <fd.kong at siat.ac.cn> writes:
>
> > Hi Barry,
> >
> > If I use the intel mpi, my code can correctly run and can produce some
> > correct results. Yes, you are right. The IBM MPI has some bugs.
>
> Fande, please report this issue to the IBM.
>
> Jeff, Fande has a reproducible case where when running on 10k cores and
> problem sizes over 100M, this
>
>   MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);
>
> returns
>
>       [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11
>       MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!
>
> It runs correctly for smaller problem sizes, smaller core counts, or for
> all sizes when using Intel MPI.  This is on Yellowstone (iDataPlex, 4500
> dx360 nodes).  Do you know someone at IBM that should be notified?
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> ALCF docs: http://www.alcf.anl.gov/user-guides
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131020/c22e546a/attachment.html>


More information about the petsc-users mailing list