[petsc-dev] potential bug with MPI_Win_fence() in openmpi-1.8.4

Satish Balay balay at mcs.anl.gov
Wed Apr 29 23:50:40 CDT 2015


OpenMPI developers,

We've had issues (memory errors) with OpenMPI - and code in PETSc
library that uses MPI_Win_fence().

Vagrind shows memory corruption deep inside OpenMPI function stack.

I'm attaching a potential patch that appears to fix this issue for us.
[the corresponding valgrind trace is listed in the patch header]

Perhaps there is a more appropriate fix for this memory corruption. Could
you check on this?

[Sorry I don't have a pure MPI test code to demonstrate this error -
but a PETSc test example code consistantly reproduces this issue]

Thanks,
Satish
-------------- next part --------------
commit ffdd25d6f4beef42a50d34f70bfe75bde077370d
Author: Satish Balay <balay at mcs.anl.gov>
Date:   Wed Apr 29 22:33:06 2015 -0500

    openmpi: potential bugfix for  PETSc sf example
    
    balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
    $ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 valgrind --tool=memcheck -q --dsymutil=yes --num-callers=40 --track-origins=yes ./ex2
    -sf_type window
    PetscSF Object: 2 MPI processes
      type: window
        synchronization=FENCE sort=rank-order
      [0] Number of roots=1, leaves=2, remote ranks=2
      [0] 0 <- (0,0)
      [0] 1 <- (1,0)
      [1] Number of roots=1, leaves=2, remote ranks=2
      [1] 0 <- (1,0)
      [1] 1 <- (0,0)
    ==14815== Invalid write of size 2
    ==14815==    at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==14815==    by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
    ==14815==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
    ==14815==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
    ==14815==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
    ==14815==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
    ==14815==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
    ==14815==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
    ==14815==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
    ==14815==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
    ==14815==    by 0xECCF0DD: ompi_request_complete (request.h:402)
    ==14815==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
    ==14815==    by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
    ==14815==    by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
    ==14815==    by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
    ==14815==    by 0x9A9E9F2: opal_progress (opal_progress.c:187)
    ==14815==    by 0xECCA70A: opal_condition_wait (condition.h:78)
    ==14815==    by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
    ==14815==    by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
    ==14815==    by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677)
    ==14815==    by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386)
    ==14815==    by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96)
    ==14815==    by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
    ==14815==    by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
    ==14815==    by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
    ==14815==    by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
    ==14815==    by 0x51303D6: PetscSFBcastEnd (sf.c:957)
    ==14815==    by 0x401DD3: main (ex2.c:81)
    ==14815==  Address 0x101c3b98 is 0 bytes after a block of size 72 alloc'd
    ==14815==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==14815==    by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
    ==14815==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
    ==14815==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
    ==14815==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
    ==14815==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
    ==14815==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
    ==14815==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
    ==14815==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
    ==14815==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
    ==14815==    by 0xECCF0DD: ompi_request_complete (request.h:402)
    ==14815==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
    ==14815==    by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
    ==14815==    by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
    ==14815==    by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
    ==14815==    by 0x9A9E9F2: opal_progress (opal_progress.c:187)
    ==14815==    by 0xECCA70A: opal_condition_wait (condition.h:78)
    ==14815==    by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
    ==14815==    by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
    ==14815==    by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677)
    ==14815==    by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386)
    ==14815==    by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96)
    ==14815==    by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
    ==14815==    by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
    ==14815==    by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
    ==14815==    by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
    ==14815==    by 0x51303D6: PetscSFBcastEnd (sf.c:957)
    ==14815==    by 0x401DD3: main (ex2.c:81)

diff --git a/ompi/datatype/ompi_datatype_args.c b/ompi/datatype/ompi_datatype_args.c
index eb14965..e9fe937 100644
--- a/ompi/datatype/ompi_datatype_args.c
+++ b/ompi/datatype/ompi_datatype_args.c
@@ -715,7 +715,7 @@ static ompi_datatype_t* __ompi_datatype_create_from_args( int32_t* i, MPI_Aint*
         ompi_datatype_create_indexed_block( i[0], i[1], &(i[2]), d[0], &datatype );
         {
             const int* a_i[3] = {&i[0], &i[1], &i[2]};
-            ompi_datatype_set_args( datatype, 2 * i[0], a_i, 0, NULL, 1, d, MPI_COMBINER_INDEXED_BLOCK );
+            ompi_datatype_set_args( datatype, 2 * i[0] + 1, a_i, 0, NULL, 1, d, MPI_COMBINER_INDEXED_BLOCK );
         }
         break;
         /******************************************************************/


More information about the petsc-dev mailing list