[petsc-dev] potential bug with MPI_Win_fence() in openmpi-1.8.4
Satish Balay
balay at mcs.anl.gov
Wed Apr 29 23:50:40 CDT 2015
OpenMPI developers,
We've had issues (memory errors) with OpenMPI - and code in PETSc
library that uses MPI_Win_fence().
Vagrind shows memory corruption deep inside OpenMPI function stack.
I'm attaching a potential patch that appears to fix this issue for us.
[the corresponding valgrind trace is listed in the patch header]
Perhaps there is a more appropriate fix for this memory corruption. Could
you check on this?
[Sorry I don't have a pure MPI test code to demonstrate this error -
but a PETSc test example code consistantly reproduces this issue]
Thanks,
Satish
-------------- next part --------------
commit ffdd25d6f4beef42a50d34f70bfe75bde077370d
Author: Satish Balay <balay at mcs.anl.gov>
Date: Wed Apr 29 22:33:06 2015 -0500
openmpi: potential bugfix for PETSc sf example
balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
$ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 valgrind --tool=memcheck -q --dsymutil=yes --num-callers=40 --track-origins=yes ./ex2
-sf_type window
PetscSF Object: 2 MPI processes
type: window
synchronization=FENCE sort=rank-order
[0] Number of roots=1, leaves=2, remote ranks=2
[0] 0 <- (0,0)
[0] 1 <- (1,0)
[1] Number of roots=1, leaves=2, remote ranks=2
[1] 0 <- (1,0)
[1] 1 <- (0,0)
==14815== Invalid write of size 2
==14815== at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==14815== by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
==14815== by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
==14815== by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
==14815== by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
==14815== by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
==14815== by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
==14815== by 0xF72887D: process_get (osc_rdma_data_move.c:536)
==14815== by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
==14815== by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
==14815== by 0xECCF0DD: ompi_request_complete (request.h:402)
==14815== by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
==14815== by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
==14815== by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
==14815== by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
==14815== by 0x9A9E9F2: opal_progress (opal_progress.c:187)
==14815== by 0xECCA70A: opal_condition_wait (condition.h:78)
==14815== by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
==14815== by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
==14815== by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677)
==14815== by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386)
==14815== by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96)
==14815== by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
==14815== by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
==14815== by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
==14815== by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
==14815== by 0x51303D6: PetscSFBcastEnd (sf.c:957)
==14815== by 0x401DD3: main (ex2.c:81)
==14815== Address 0x101c3b98 is 0 bytes after a block of size 72 alloc'd
==14815== at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==14815== by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
==14815== by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
==14815== by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
==14815== by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
==14815== by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
==14815== by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
==14815== by 0xF72887D: process_get (osc_rdma_data_move.c:536)
==14815== by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
==14815== by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
==14815== by 0xECCF0DD: ompi_request_complete (request.h:402)
==14815== by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
==14815== by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
==14815== by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
==14815== by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
==14815== by 0x9A9E9F2: opal_progress (opal_progress.c:187)
==14815== by 0xECCA70A: opal_condition_wait (condition.h:78)
==14815== by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
==14815== by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
==14815== by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677)
==14815== by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386)
==14815== by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96)
==14815== by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
==14815== by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
==14815== by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
==14815== by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
==14815== by 0x51303D6: PetscSFBcastEnd (sf.c:957)
==14815== by 0x401DD3: main (ex2.c:81)
diff --git a/ompi/datatype/ompi_datatype_args.c b/ompi/datatype/ompi_datatype_args.c
index eb14965..e9fe937 100644
--- a/ompi/datatype/ompi_datatype_args.c
+++ b/ompi/datatype/ompi_datatype_args.c
@@ -715,7 +715,7 @@ static ompi_datatype_t* __ompi_datatype_create_from_args( int32_t* i, MPI_Aint*
ompi_datatype_create_indexed_block( i[0], i[1], &(i[2]), d[0], &datatype );
{
const int* a_i[3] = {&i[0], &i[1], &i[2]};
- ompi_datatype_set_args( datatype, 2 * i[0], a_i, 0, NULL, 1, d, MPI_COMBINER_INDEXED_BLOCK );
+ ompi_datatype_set_args( datatype, 2 * i[0] + 1, a_i, 0, NULL, 1, d, MPI_COMBINER_INDEXED_BLOCK );
}
break;
/******************************************************************/
More information about the petsc-dev
mailing list