[petsc-users] PetscSFReduceBegin does not work correctly on openmpi-1.4.3with64 integers

Satish Balay balay at mcs.anl.gov
Tue Sep 11 13:02:31 CDT 2012


What about using latest openmpi-1.6 series?

Satish

On Tue, 11 Sep 2012, Jed Brown wrote:

> Open MPI one-sided operations with datatypes still have known bugs. They
> have had bug report s with reduced test cases for several years now. They
> need to fix those bugs. Please let them know that you are also waiting...
> 
> To work around that, and for other reasons, I will write a new SF
> implementation using point-to-point.
> On Sep 11, 2012 12:44 PM, "fdkong" <fd.kong at foxmail.com> wrote:
> 
> > Hi Matt,
> >
> > Thanks. I guess there are two reasons:
> >
> > (1) The MPI function MPI_Accumulate with operation MPI_RELACE is not
> > supported in the implementation of OpenMPI 1.4.3. or other OpenMPI versions.
> >
> > (2) The MPI function dose not accept the datatype MPIU_2INT, when we use
> > 64-bit integers.  But when we run on MPICH, it works well!
> >
> > ------------------
> > Fande Kong
> > ShenZhen Institutes of Advanced Technology
> > Chinese Academy of Sciences
> >
> > **
> >
> >
> > ------------------ Original ------------------
> > *From: * "knepley"<knepley at gmail.com>;
> > *Date: * Tue, Sep 11, 2012 01:33 PM
> > *To: * "fdkong"<fd.kong at foxmail.com>; **
> > *Cc: * "petsc-users"<petsc-users at mcs.anl.gov>; **
> > *Subject: * Re: PetscSFReduceBegin does not work correctly on
> > openmpi-1.4.3with64 integers
> >
> > On Tue, Sep 11, 2012 at 12:05 AM, fdkong <fd.kong at foxmail.com> wrote:
> >
> >> Hi Matt,
> >>
> >> I tested src/sys/sf/examples/tutorials/ex1 on OpenMPI and MPICH
> >> seperately respectively. I found the error come from the function
> >> PetscSFReduceBegin called by PetscSFCreateInverseSF. I used the script
> >> below:
> >>
> >
> > Thanks for testing this. I will run it myself and track down the bug.
> >
> > Matt
> >
> >> mpirun -n 2 ./ex1 -test_invert
> >>
> >> (1) On OpenMPI, got the result below:
> >>
> >> Star Forest Object: 2 MPI processes
> >> type not yet set
> >> synchronization=FENCE sort=rank-order
> >>  [0] Number of roots=3, leaves=2, remote ranks=1
> >> [0] 0 <- (1,1)
> >> [0] 1 <- (1,0)
> >> [1] Number of roots=2, leaves=3, remote ranks=1
> >> [1] 0 <- (0,1)
> >> [1] 1 <- (0,0)
> >>  [1] 2 <- (0,2)
> >> ## Multi-SF
> >> Star Forest Object: 2 MPI processes
> >> type not yet set
> >> synchronization=FENCE sort=rank-order
> >> [0] Number of roots=3, leaves=2, remote ranks=1
> >>  [0] 0 <- (1,1)
> >> [0] 1 <- (1,0)
> >> [1] Number of roots=2, leaves=3, remote ranks=1
> >> [1] 0 <- (0,2)
> >> [1] 1 <- (0,0)
> >> [1] 2 <- (0,2)
> >> ## Inverse of Multi-SF
> >> Star Forest Object: 2 MPI processes
> >> type not yet set
> >> synchronization=FENCE sort=rank-order
> >> [0] Number of roots=2, leaves=0, remote ranks=0
> >> [1] Number of roots=3, leaves=0, remote ranks=0
> >>
> >> (2) On MPICH, got the result below:
> >>
> >> Star Forest Object: 2 MPI processes
> >> type not yet set
> >> synchronization=FENCE sort=rank-order
> >> [0] Number of roots=3, leaves=2, remote ranks=1
> >>  [0] 0 <- (1,1)
> >> [0] 1 <- (1,0)
> >> [1] Number of roots=2, leaves=3, remote ranks=1
> >> [1] 0 <- (0,1)
> >> [1] 1 <- (0,0)
> >> [1] 2 <- (0,2)
> >> ## Multi-SF
> >> Star Forest Object: 2 MPI processes
> >> type not yet set
> >> synchronization=FENCE sort=rank-order
> >> [0] Number of roots=3, leaves=2, remote ranks=1
> >> [0] 0 <- (1,1)
> >> [0] 1 <- (1,0)
> >>  [1] Number of roots=2, leaves=3, remote ranks=1
> >> [1] 0 <- (0,1)
> >> [1] 1 <- (0,0)
> >> [1] 2 <- (0,2)
> >> ## Inverse of Multi-SF
> >> Star Forest Object: 2 MPI processes
> >>  type not yet set
> >> synchronization=FENCE sort=rank-order
> >> [0] Number of roots=2, leaves=3, remote ranks=1
> >> [0] 0 <- (1,1)
> >> [0] 1 <- (1,0)
> >> [0] 2 <- (1,2)
> >>  [1] Number of roots=3, leaves=2, remote ranks=1
> >> [1] 0 <- (0,1)
> >> [1] 1 <- (0,0)
> >>
> >> From two above results, you could found that the inverse of Multi-SF is
> >> incorrect on OpenMPI. Could you please take some debugs on OpenMPI (1.4.3)
> >> with 64-bit integers?
> >>
> >> In my code, I call DMComplexDistribute that calls PetscSFCreateInverseSF
> >> that calls PetscSFReduceBegin. I had taken a lot of debugs, and found the
> >> error come from the PetscSFReduceBegin.
> >>
> >> On Mon, Sep 10, 2012 at 10:47 PM, fdkong <fd.kong at foxmail.com> wrote:
> >> **
> >>
> >>>
> >>> >> Hi all,
> >>> >>
> >>> >> The function PetscSFReduceBegin runs well on MPICH, but does not work
> >>> >> on openmpi-1.4.3, with 64 integers. Anyone knows why?
> >>> >>
> >>>
> >>> >1) What error are you seeing? There are no errors in the build.
> >>>
> >>> Yes, There are no errors in the build and configure. But when I ran my
> >>> code involved the function PetscSFReduceBegin on supercomputer, I got the
> >>> error below:
> >>>
> >>
> >> Can you run src/sys/sf/examples/tutorials/ex1? There are several tests in
> >> the makefile there. I suspect
> >> that your graph is not correctly specified.
> >>
> >> Matt
> >>
> >>> [0]PETSC ERROR:
> >>> ------------------------------------------------------------------------
> >>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> >>> probably memory access out of range
> >>> [0]PETSC ERROR: Try option -start_in_debugger or
> >>> -on_error_attach_debugger
> >>> [0]PETSC ERROR: or see
> >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSCERROR: or try
> >>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
> >>> corruption errors
> >>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> >>> and run
> >>> [0]PETSC ERROR: to get more information on the crash.
> >>> [0]PETSC ERROR: --------------------- Error Message
> >>> ------------------------------------
> >>> [0]PETSC ERROR: Signal received!
> >>> [0]PETSC ERROR:
> >>> ------------------------------------------------------------------------
> >>> [0]PETSC ERROR: Petsc Release Version 3.3.0, Patch 3, Wed Aug 29
> >>> 11:26:24 CDT 2012
> >>> [0]PETSC ERROR: See docs/changes/index.html for recent updates.
> >>> [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> >>> [0]PETSC ERROR: See docs/index.html for manual pages.
> >>> [0]PETSC ERROR:
> >>> ------------------------------------------------------------------------
> >>> [0]PETSC ERROR: ./linearElasticity on a arch-linu named node0353 by
> >>> fako9399 Mon Sep 10 16:50:42 2012
> >>> [0]PETSC ERROR: Libraries linked from
> >>> /projects/fako9399/petsc-3.3-p3/arch-linux264-cxx-opt/lib
> >>> [0]PETSC ERROR: Configure run at Mon Sep 10 13:58:46 2012
> >>> [0]PETSC ERROR: Configure options --known-level1-dcache-size=32768
> >>> --known-level1-dcache-linesize=32 --known-level1-dcache-assoc=0
> >>> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
> >>> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> >>> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> >>> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=8
> >>> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-clanguage=cxx
> >>> --with-shared-libraries=1 --with-dynamic-loading=1
> >>> --download-f-blas-lapack=1 --with-batch=1 --known-mpi-shared-libraries=0
> >>> --with-mpi-shared=1 --download-parmetis=1 --download-metis=1
> >>> --with-64-bit-indices=1
> >>> --with-netcdf-dir=/projects/fako9399/petsc-3.3-p3/externalpackage/netcdf-4.1.3install
> >>> --download-exodusii=1 --with-debugging=no --download-ptscotch=1
> >>> [0]PETSC ERROR:
> >>> ------------------------------------------------------------------------
> >>> [0]PETSC ERROR: User provided function() line 0 in unknown directory
> >>> unknown file
> >>>
> >>> --------------------------------------------------------------------------
> >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> >>> with errorcode 59.
> >>>
> >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> >>> You may or may not see output from other processes, depending on
> >>> exactly when Open MPI kills them.
> >>>
> >>> --------------------------------------------------------------------------
> >>>
> >>> --------------------------------------------------------------------------
> >>> mpirun has exited due to process rank 0 with PID 1517 on
> >>> node node0353 exiting without calling "finalize". This may
> >>> have caused other processes in the application to be
> >>> terminated by signals sent by mpirun (as reported here).
> >>>
> >>> --------------------------------------------------------------------------
> >>>
> >>> I had done some debugs, and then found the error came from the function
> >>> PetscSFReduceBegin.
> >>>
> >>> >2) Please do not send logs to petsc-users, send them to
> >>> >petsc-maint at mcs.anl.gov
> >>>
> >>> Ok, Thanks.
> >>>
> >>> > Matt
> >>>
> >>>
> >>> >> Maybe this link could help us guess why?
> >>> >> http://www.open-mpi.org/community/lists/devel/2005/11/0517.php
> >>> >>
> >>> >> I attached the configure.log and make.log files.
> >>> >> ------------------
> >>> >> Fande Kong
> >>> >> ShenZhen Institutes of Advanced Technology
> >>> >> Chinese Academy of Sciences
> >>> >>
> >>> >>
> >>>
> >>
> >>
> >>
> >> --
> >> What most experimenters take for granted before they begin their
> >> experiments is infinitely more interesting than any results to which their
> >> experiments lead.
> >> -- Norbert Wiener
> >> **
> >>
> >
> >
> >
> > --
> > What most experimenters take for granted before they begin their
> > experiments is infinitely more interesting than any results to which their
> > experiments lead.
> > -- Norbert Wiener
> > **
> >
> 



More information about the petsc-users mailing list