[petsc-users] PetscSFReduceBegin does not work correctly on openmpi-1.4.3 with64 integers

Matthew Knepley knepley at gmail.com
Tue Sep 11 00:33:16 CDT 2012


On Tue, Sep 11, 2012 at 12:05 AM, fdkong <fd.kong at foxmail.com> wrote:

> Hi Matt,
>
> I tested src/sys/sf/examples/tutorials/ex1 on OpenMPI and MPICH
> seperately respectively. I found the error come from the
> function PetscSFReduceBegin called by PetscSFCreateInverseSF. I used the
> script below:
>

Thanks for testing this. I will run it myself and track down the bug.

   Matt


> mpirun -n 2 ./ex1  -test_invert
>
> (1) On OpenMPI, got the result below:
>
> Star Forest Object: 2 MPI processes
>   type not yet set
>   synchronization=FENCE sort=rank-order
>   [0] Number of roots=3, leaves=2, remote ranks=1
>   [0] 0 <- (1,1)
>   [0] 1 <- (1,0)
>   [1] Number of roots=2, leaves=3, remote ranks=1
>   [1] 0 <- (0,1)
>   [1] 1 <- (0,0)
>   [1] 2 <- (0,2)
> ## Multi-SF
> Star Forest Object: 2 MPI processes
>   type not yet set
>   synchronization=FENCE sort=rank-order
>   [0] Number of roots=3, leaves=2, remote ranks=1
>   [0] 0 <- (1,1)
>   [0] 1 <- (1,0)
>   [1] Number of roots=2, leaves=3, remote ranks=1
>   [1] 0 <- (0,2)
>   [1] 1 <- (0,0)
>   [1] 2 <- (0,2)
> ## Inverse of Multi-SF
> Star Forest Object: 2 MPI processes
>   type not yet set
>   synchronization=FENCE sort=rank-order
>   [0] Number of roots=2, leaves=0, remote ranks=0
>   [1] Number of roots=3, leaves=0, remote ranks=0
>
> (2) On MPICH, got the result below:
>
> Star Forest Object: 2 MPI processes
>   type not yet set
>   synchronization=FENCE sort=rank-order
>   [0] Number of roots=3, leaves=2, remote ranks=1
>   [0] 0 <- (1,1)
>   [0] 1 <- (1,0)
>   [1] Number of roots=2, leaves=3, remote ranks=1
>   [1] 0 <- (0,1)
>   [1] 1 <- (0,0)
>   [1] 2 <- (0,2)
> ## Multi-SF
> Star Forest Object: 2 MPI processes
>   type not yet set
>   synchronization=FENCE sort=rank-order
>   [0] Number of roots=3, leaves=2, remote ranks=1
>   [0] 0 <- (1,1)
>   [0] 1 <- (1,0)
>   [1] Number of roots=2, leaves=3, remote ranks=1
>   [1] 0 <- (0,1)
>   [1] 1 <- (0,0)
>   [1] 2 <- (0,2)
> ## Inverse of Multi-SF
> Star Forest Object: 2 MPI processes
>   type not yet set
>   synchronization=FENCE sort=rank-order
>   [0] Number of roots=2, leaves=3, remote ranks=1
>   [0] 0 <- (1,1)
>   [0] 1 <- (1,0)
>   [0] 2 <- (1,2)
>   [1] Number of roots=3, leaves=2, remote ranks=1
>   [1] 0 <- (0,1)
>   [1] 1 <- (0,0)
>
> From two above results, you could found that the inverse of Multi-SF is
> incorrect on OpenMPI.  Could you please take some debugs on OpenMPI (1.4.3)
> with 64-bit integers?
>
> In my code, I call DMComplexDistribute that calls PetscSFCreateInverseSF
> that calls  PetscSFReduceBegin. I had taken a lot of debugs, and found the
> error come from the PetscSFReduceBegin.
>
> On Mon, Sep 10, 2012 at 10:47 PM, fdkong <fd.kong at foxmail.com> wrote:
> **
>
>>
>> >> Hi all,
>> >>
>> >> The function PetscSFReduceBegin runs well on MPICH, but does not work
>> >> on openmpi-1.4.3, with 64 integers. Anyone knows why?
>> >>
>>
>> >1) What error are you seeing? There are no errors in the build.
>>
>> Yes, There are no errors in the build and configure. But when I ran my
>> code involved the function PetscSFReduceBegin on supercomputer, I got the
>> error below:
>>
>
> Can you run src/sys/sf/examples/tutorials/ex1? There are several tests in
> the makefile there. I suspect
> that your graph is not correctly specified.
>
> Matt
>
>> [0]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>> [0]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSCERROR: or try
>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>> corruption errors
>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>> and run
>> [0]PETSC ERROR: to get more information on the crash.
>> [0]PETSC ERROR: --------------------- Error Message
>> ------------------------------------
>> [0]PETSC ERROR: Signal received!
>> [0]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: Petsc Release Version 3.3.0, Patch 3, Wed Aug 29 11:26:24
>> CDT 2012
>> [0]PETSC ERROR: See docs/changes/index.html for recent updates.
>> [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>> [0]PETSC ERROR: See docs/index.html for manual pages.
>> [0]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: ./linearElasticity on a arch-linu named node0353 by
>> fako9399 Mon Sep 10 16:50:42 2012
>> [0]PETSC ERROR: Libraries linked from
>> /projects/fako9399/petsc-3.3-p3/arch-linux264-cxx-opt/lib
>> [0]PETSC ERROR: Configure run at Mon Sep 10 13:58:46 2012
>> [0]PETSC ERROR: Configure options --known-level1-dcache-size=32768
>> --known-level1-dcache-linesize=32 --known-level1-dcache-assoc=0
>> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
>> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
>> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
>> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=8
>> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-clanguage=cxx
>> --with-shared-libraries=1 --with-dynamic-loading=1
>> --download-f-blas-lapack=1 --with-batch=1 --known-mpi-shared-libraries=0
>> --with-mpi-shared=1 --download-parmetis=1 --download-metis=1
>> --with-64-bit-indices=1
>> --with-netcdf-dir=/projects/fako9399/petsc-3.3-p3/externalpackage/netcdf-4.1.3install
>> --download-exodusii=1 --with-debugging=no --download-ptscotch=1
>> [0]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: User provided function() line 0 in unknown directory
>> unknown file
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 59.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 1517 on
>> node node0353 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>>
>> I had done some debugs, and then found the error came from the function
>> PetscSFReduceBegin.
>>
>> >2) Please do not send logs to petsc-users, send them to
>> >petsc-maint at mcs.anl.gov
>>
>> Ok, Thanks.
>>
>> > Matt
>>
>>
>> >> Maybe this link could help us guess why?
>> >> http://www.open-mpi.org/community/lists/devel/2005/11/0517.php
>> >>
>> >> I attached the configure.log and make.log files.
>> >> ------------------
>> >> Fande Kong
>> >> ShenZhen Institutes of Advanced Technology
>> >> Chinese Academy of Sciences
>> >>
>> >>
>>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
> **
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120911/2f222c16/attachment-0001.html>


More information about the petsc-users mailing list