[petsc-users] PetscSFReduceBegin does not work correctly on openmpi-1.4.3with64 integers

Jed Brown jedbrown at mcs.anl.gov
Tue Sep 11 12:12:23 CDT 2012


Open MPI one-sided operations with datatypes still have known bugs. They
have had bug report s with reduced test cases for several years now. They
need to fix those bugs. Please let them know that you are also waiting...

To work around that, and for other reasons, I will write a new SF
implementation using point-to-point.
On Sep 11, 2012 12:44 PM, "fdkong" <fd.kong at foxmail.com> wrote:

> Hi Matt,
>
> Thanks. I guess there are two reasons:
>
> (1) The MPI function MPI_Accumulate with operation MPI_RELACE is not
> supported in the implementation of OpenMPI 1.4.3. or other OpenMPI versions.
>
> (2) The MPI function dose not accept the datatype MPIU_2INT, when we use
> 64-bit integers.  But when we run on MPICH, it works well!
>
> ------------------
> Fande Kong
> ShenZhen Institutes of Advanced Technology
> Chinese Academy of Sciences
>
> **
>
>
> ------------------ Original ------------------
> *From: * "knepley"<knepley at gmail.com>;
> *Date: * Tue, Sep 11, 2012 01:33 PM
> *To: * "fdkong"<fd.kong at foxmail.com>; **
> *Cc: * "petsc-users"<petsc-users at mcs.anl.gov>; **
> *Subject: * Re: PetscSFReduceBegin does not work correctly on
> openmpi-1.4.3with64 integers
>
> On Tue, Sep 11, 2012 at 12:05 AM, fdkong <fd.kong at foxmail.com> wrote:
>
>> Hi Matt,
>>
>> I tested src/sys/sf/examples/tutorials/ex1 on OpenMPI and MPICH
>> seperately respectively. I found the error come from the function
>> PetscSFReduceBegin called by PetscSFCreateInverseSF. I used the script
>> below:
>>
>
> Thanks for testing this. I will run it myself and track down the bug.
>
> Matt
>
>> mpirun -n 2 ./ex1 -test_invert
>>
>> (1) On OpenMPI, got the result below:
>>
>> Star Forest Object: 2 MPI processes
>> type not yet set
>> synchronization=FENCE sort=rank-order
>>  [0] Number of roots=3, leaves=2, remote ranks=1
>> [0] 0 <- (1,1)
>> [0] 1 <- (1,0)
>> [1] Number of roots=2, leaves=3, remote ranks=1
>> [1] 0 <- (0,1)
>> [1] 1 <- (0,0)
>>  [1] 2 <- (0,2)
>> ## Multi-SF
>> Star Forest Object: 2 MPI processes
>> type not yet set
>> synchronization=FENCE sort=rank-order
>> [0] Number of roots=3, leaves=2, remote ranks=1
>>  [0] 0 <- (1,1)
>> [0] 1 <- (1,0)
>> [1] Number of roots=2, leaves=3, remote ranks=1
>> [1] 0 <- (0,2)
>> [1] 1 <- (0,0)
>> [1] 2 <- (0,2)
>> ## Inverse of Multi-SF
>> Star Forest Object: 2 MPI processes
>> type not yet set
>> synchronization=FENCE sort=rank-order
>> [0] Number of roots=2, leaves=0, remote ranks=0
>> [1] Number of roots=3, leaves=0, remote ranks=0
>>
>> (2) On MPICH, got the result below:
>>
>> Star Forest Object: 2 MPI processes
>> type not yet set
>> synchronization=FENCE sort=rank-order
>> [0] Number of roots=3, leaves=2, remote ranks=1
>>  [0] 0 <- (1,1)
>> [0] 1 <- (1,0)
>> [1] Number of roots=2, leaves=3, remote ranks=1
>> [1] 0 <- (0,1)
>> [1] 1 <- (0,0)
>> [1] 2 <- (0,2)
>> ## Multi-SF
>> Star Forest Object: 2 MPI processes
>> type not yet set
>> synchronization=FENCE sort=rank-order
>> [0] Number of roots=3, leaves=2, remote ranks=1
>> [0] 0 <- (1,1)
>> [0] 1 <- (1,0)
>>  [1] Number of roots=2, leaves=3, remote ranks=1
>> [1] 0 <- (0,1)
>> [1] 1 <- (0,0)
>> [1] 2 <- (0,2)
>> ## Inverse of Multi-SF
>> Star Forest Object: 2 MPI processes
>>  type not yet set
>> synchronization=FENCE sort=rank-order
>> [0] Number of roots=2, leaves=3, remote ranks=1
>> [0] 0 <- (1,1)
>> [0] 1 <- (1,0)
>> [0] 2 <- (1,2)
>>  [1] Number of roots=3, leaves=2, remote ranks=1
>> [1] 0 <- (0,1)
>> [1] 1 <- (0,0)
>>
>> From two above results, you could found that the inverse of Multi-SF is
>> incorrect on OpenMPI. Could you please take some debugs on OpenMPI (1.4.3)
>> with 64-bit integers?
>>
>> In my code, I call DMComplexDistribute that calls PetscSFCreateInverseSF
>> that calls PetscSFReduceBegin. I had taken a lot of debugs, and found the
>> error come from the PetscSFReduceBegin.
>>
>> On Mon, Sep 10, 2012 at 10:47 PM, fdkong <fd.kong at foxmail.com> wrote:
>> **
>>
>>>
>>> >> Hi all,
>>> >>
>>> >> The function PetscSFReduceBegin runs well on MPICH, but does not work
>>> >> on openmpi-1.4.3, with 64 integers. Anyone knows why?
>>> >>
>>>
>>> >1) What error are you seeing? There are no errors in the build.
>>>
>>> Yes, There are no errors in the build and configure. But when I ran my
>>> code involved the function PetscSFReduceBegin on supercomputer, I got the
>>> error below:
>>>
>>
>> Can you run src/sys/sf/examples/tutorials/ex1? There are several tests in
>> the makefile there. I suspect
>> that your graph is not correctly specified.
>>
>> Matt
>>
>>> [0]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably memory access out of range
>>> [0]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> [0]PETSC ERROR: or see
>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSCERROR: or try
>>> http://valgrind.org on GNU/linux and Apple Mac OS X to find memory
>>> corruption errors
>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>>> and run
>>> [0]PETSC ERROR: to get more information on the crash.
>>> [0]PETSC ERROR: --------------------- Error Message
>>> ------------------------------------
>>> [0]PETSC ERROR: Signal received!
>>> [0]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [0]PETSC ERROR: Petsc Release Version 3.3.0, Patch 3, Wed Aug 29
>>> 11:26:24 CDT 2012
>>> [0]PETSC ERROR: See docs/changes/index.html for recent updates.
>>> [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
>>> [0]PETSC ERROR: See docs/index.html for manual pages.
>>> [0]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [0]PETSC ERROR: ./linearElasticity on a arch-linu named node0353 by
>>> fako9399 Mon Sep 10 16:50:42 2012
>>> [0]PETSC ERROR: Libraries linked from
>>> /projects/fako9399/petsc-3.3-p3/arch-linux264-cxx-opt/lib
>>> [0]PETSC ERROR: Configure run at Mon Sep 10 13:58:46 2012
>>> [0]PETSC ERROR: Configure options --known-level1-dcache-size=32768
>>> --known-level1-dcache-linesize=32 --known-level1-dcache-assoc=0
>>> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
>>> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
>>> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
>>> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=8
>>> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-clanguage=cxx
>>> --with-shared-libraries=1 --with-dynamic-loading=1
>>> --download-f-blas-lapack=1 --with-batch=1 --known-mpi-shared-libraries=0
>>> --with-mpi-shared=1 --download-parmetis=1 --download-metis=1
>>> --with-64-bit-indices=1
>>> --with-netcdf-dir=/projects/fako9399/petsc-3.3-p3/externalpackage/netcdf-4.1.3install
>>> --download-exodusii=1 --with-debugging=no --download-ptscotch=1
>>> [0]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [0]PETSC ERROR: User provided function() line 0 in unknown directory
>>> unknown file
>>>
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 59.
>>>
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun has exited due to process rank 0 with PID 1517 on
>>> node node0353 exiting without calling "finalize". This may
>>> have caused other processes in the application to be
>>> terminated by signals sent by mpirun (as reported here).
>>>
>>> --------------------------------------------------------------------------
>>>
>>> I had done some debugs, and then found the error came from the function
>>> PetscSFReduceBegin.
>>>
>>> >2) Please do not send logs to petsc-users, send them to
>>> >petsc-maint at mcs.anl.gov
>>>
>>> Ok, Thanks.
>>>
>>> > Matt
>>>
>>>
>>> >> Maybe this link could help us guess why?
>>> >> http://www.open-mpi.org/community/lists/devel/2005/11/0517.php
>>> >>
>>> >> I attached the configure.log and make.log files.
>>> >> ------------------
>>> >> Fande Kong
>>> >> ShenZhen Institutes of Advanced Technology
>>> >> Chinese Academy of Sciences
>>> >>
>>> >>
>>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>> **
>>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
> **
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120911/43f3f9cd/attachment.html>


More information about the petsc-users mailing list