[petsc-dev] failed nightly

Barry Smith bsmith at mcs.anl.gov
Fri Apr 10 13:06:43 CDT 2015


  Not testing something that doesn't work in order to not have an error in the tests doesn't seem right to me. Shouldn't the window stuff be fixed or removed rather than leaving buggy code?

  barry

> On Apr 10, 2015, at 2:45 AM, Lawrence Mitchell <lawrence.mitchell at imperial.ac.uk> wrote:
> 
> (cc'ing petsc-dev as well)
> 
>> On 10 Apr 2015, at 00:51, Satish Balay <balay at mcs.anl.gov> wrote:
>> 
>> Its likely a codebug some where. MPICH build also gives a valgrind trace.
> 
> It's probable that the OMPI implementation is buggy enough to completely not work.  I'm a little confused by the MPICH issue.  I don't understand enough about the datatype implementation in the window SF type to know if this is a PETSc issue, or an MPICH one.  I note in passing that all the ex1 tests exhibit a similar valgrind trace.
> 
> For ex2 at least, maybe the simplest option is to turn off the window test entirely.  Like this:
> 
> diff --git a/src/vec/is/sf/examples/tutorials/makefile b/src/vec/is/sf/examples/tutorials
> index aeaf1e4..e7774c5 100644
> --- a/src/vec/is/sf/examples/tutorials/makefile
> +++ b/src/vec/is/sf/examples/tutorials/makefile
> @@ -86,7 +86,7 @@ runex2_window:
>           ${RM} -f ex2.tmp
> 
> TESTEXAMPLES_C             = ex1.PETSc runex1_basic runex1_2_basic runex1_3_basic runex1
> -                                ex2.PETSc runex2_basic runex2_window ex2.rm
> +                                ex2.PETSc runex2_basic ex2.rm
> TESTEXAMPLES_C_X           =
> TESTEXAMPLES_FORTRAN       =
> TESTEXAMPLES_FORTRAN_MPIUNI =
> 
> Lawrence
> 
>> Satish
>> 
>> ----------
>> 
>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>> $ mpiexec -n 2 valgrind --tool=memcheck -q --dsymutil=yes --num-callers=40 --track-origins=yes ./ex2 -sf_type window
>> PetscSF Object: 2 MPI processes
>> type: window
>>   synchronization=FENCE sort=rank-order
>> [0] Number of roots=1, leaves=2, remote ranks=2
>> [0] 0 <- (0,0)
>> [0] 1 <- (1,0)
>> [1] Number of roots=1, leaves=2, remote ranks=2
>> [1] 0 <- (1,0)
>> [1] 1 <- (0,0)
>> ==29265== Syscall param writev(vector[...]) points to uninitialised byte(s)
>> ==29265==    at 0x8F474E7: writev (in /usr/lib64/libc-2.20.so)
>> ==29265==    by 0x894AD87: MPL_large_writev (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x8941A48: MPIDU_Sock_writev (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x892AC7D: MPIDI_CH3_iStartMsgv (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x8911D08: recv_rma_msg (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x8913D46: MPIDI_Win_fence (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x88C91EB: PMPI_Win_fence (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x50FB025: PetscSFRestoreWindow (sfwindow.c:348)
>> ==29265==    by 0x50FD4BF: PetscSFBcastEnd_Window (sfwindow.c:510)
>> ==29265==    by 0x5123CD9: PetscSFBcastEnd (sf.c:957)
>> ==29265==    by 0x401CAF: main (ex2.c:81)
>> ==29265==  Address 0x99436ec is 108 bytes inside a block of size 208 alloc'd
>> ==29265==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==29265==    by 0x890DA35: MPIDI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x88C406A: PMPI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x50FD0DA: PetscSFBcastBegin_Window (sfwindow.c:495)
>> ==29265==    by 0x51235B5: PetscSFBcastBegin (sf.c:924)
>> ==29265==    by 0x401BD3: main (ex2.c:79)
>> ==29265==  Uninitialised value was created by a heap allocation
>> ==29265==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==29265==    by 0x890DA35: MPIDI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x88C406A: PMPI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29265==    by 0x50FD0DA: PetscSFBcastBegin_Window (sfwindow.c:495)
>> ==29265==    by 0x51235B5: PetscSFBcastBegin (sf.c:924)
>> ==29265==    by 0x401BD3: main (ex2.c:79)
>> ==29265==
>> ==29266== Syscall param writev(vector[...]) points to uninitialised byte(s)
>> ==29266==    at 0x8F474E7: writev (in /usr/lib64/libc-2.20.so)
>> ==29266==    by 0x894AD87: MPL_large_writev (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x8941A48: MPIDU_Sock_writev (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x892AC7D: MPIDI_CH3_iStartMsgv (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x8911D08: recv_rma_msg (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x8913D46: MPIDI_Win_fence (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x88C91EB: PMPI_Win_fence (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x50FB025: PetscSFRestoreWindow (sfwindow.c:348)
>> ==29266==    by 0x50FD4BF: PetscSFBcastEnd_Window (sfwindow.c:510)
>> ==29266==    by 0x5123CD9: PetscSFBcastEnd (sf.c:957)
>> ==29266==    by 0x401CAF: main (ex2.c:81)
>> ==29266==  Address 0x98d16dc is 108 bytes inside a block of size 208 alloc'd
>> ==29266==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==29266==    by 0x890DA35: MPIDI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x88C406A: PMPI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x50FD0DA: PetscSFBcastBegin_Window (sfwindow.c:495)
>> ==29266==    by 0x51235B5: PetscSFBcastBegin (sf.c:924)
>> ==29266==    by 0x401BD3: main (ex2.c:79)
>> ==29266==  Uninitialised value was created by a heap allocation
>> ==29266==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>> ==29266==    by 0x890DA35: MPIDI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x88C406A: PMPI_Get (in /home/balay/soft/mpich-3.1.3/lib/libmpi.so.12.0.4)
>> ==29266==    by 0x50FD0DA: PetscSFBcastBegin_Window (sfwindow.c:495)
>> ==29266==    by 0x51235B5: PetscSFBcastBegin (sf.c:924)
>> ==29266==    by 0x401BD3: main (ex2.c:79)
>> ==29266==
>> Vec Object: 2 MPI processes
>> type: mpi
>> Process [0]
>> 0
>> 1
>> Process [1]
>> 1
>> 0
>> Vec Object: 2 MPI processes
>> type: mpi
>> Process [0]
>> 10
>> 11
>> Process [1]
>> 11
>> 10
>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>> $
>> 
>> On Thu, 9 Apr 2015, Satish Balay wrote:
>> 
>>> here is a better valgrind trace..
>>> 
>>> satish
>>> 
>>> --------
>>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>>> $ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 valgrind --tool=memcheck -q --dsymutil=yes --num-callers=40 --track-origins=yes ./ex2 -sf_type window
>>> PetscSF Object: 2 MPI processes
>>> type: window
>>>   synchronization=FENCE sort=rank-order
>>> [0] Number of roots=1, leaves=2, remote ranks=2
>>> [0] 0 <- (0,0)
>>> [0] 1 <- (1,0)
>>> [1] Number of roots=1, leaves=2, remote ranks=2
>>> [1] 0 <- (1,0)
>>> [1] 1 <- (0,0)
>>> ==14815== Invalid write of size 2
>>> ==14815==    at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>> ==14815==    by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
>>> ==14815==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>> ==14815==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>> ==14815==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>> ==14815==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>> ==14815==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>> ==14815==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>> ==14815==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>> ==14815==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>> ==14815==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>> ==14815==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>> ==14815==    by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
>>> ==14815==    by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
>>> ==14815==    by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
>>> ==14815==    by 0x9A9E9F2: opal_progress (opal_progress.c:187)
>>> ==14815==    by 0xECCA70A: opal_condition_wait (condition.h:78)
>>> ==14815==    by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
>>> ==14815==    by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
>>> ==14815==    by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677)
>>> ==14815==    by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386)
>>> ==14815==    by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96)
>>> ==14815==    by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
>>> ==14815==    by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
>>> ==14815==    by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
>>> ==14815==    by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
>>> ==14815==    by 0x51303D6: PetscSFBcastEnd (sf.c:957)
>>> ==14815==    by 0x401DD3: main (ex2.c:81)
>>> ==14815==  Address 0x101c3b98 is 0 bytes after a block of size 72 alloc'd
>>> ==14815==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>> ==14815==    by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
>>> ==14815==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>> ==14815==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>> ==14815==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>> ==14815==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>> ==14815==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>> ==14815==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>> ==14815==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>> ==14815==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>> ==14815==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>> ==14815==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>> ==14815==    by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
>>> ==14815==    by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
>>> ==14815==    by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
>>> ==14815==    by 0x9A9E9F2: opal_progress (opal_progress.c:187)
>>> ==14815==    by 0xECCA70A: opal_condition_wait (condition.h:78)
>>> ==14815==    by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
>>> ==14815==    by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
>>> ==14815==    by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677)
>>> ==14815==    by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386)
>>> ==14815==    by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96)
>>> ==14815==    by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
>>> ==14815==    by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
>>> ==14815==    by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
>>> ==14815==    by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
>>> ==14815==    by 0x51303D6: PetscSFBcastEnd (sf.c:957)
>>> ==14815==    by 0x401DD3: main (ex2.c:81)
>>> ==14815==
>>> ==14816== Invalid write of size 2
>>> ==14816==    at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>> ==14816==    by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
>>> ==14816==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>> ==14816==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>> ==14816==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>> ==14816==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>> ==14816==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>> ==14816==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>> ==14816==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>> ==14816==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>> ==14816==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>> ==14816==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>> ==14816==    by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
>>> ==14816==    by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
>>> ==14816==    by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
>>> ==14816==    by 0x9A9E9F2: opal_progress (opal_progress.c:187)
>>> ==14816==    by 0xECCA70A: opal_condition_wait (condition.h:78)
>>> ==14816==    by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
>>> ==14816==    by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
>>> ==14816==    by 0xFD8D951: ompi_coll_tuned_scatter_intra_basic_linear (coll_tuned_scatter.c:231)
>>> ==14816==    by 0xFD7A66D: ompi_coll_tuned_scatter_intra_dec_fixed (coll_tuned_decision_fixed.c:769)
>>> ==14816==    by 0xF0F3BDB: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:102)
>>> ==14816==    by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
>>> ==14816==    by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
>>> ==14816==    by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
>>> ==14816==    by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
>>> ==14816==    by 0x51303D6: PetscSFBcastEnd (sf.c:957)
>>> ==14816==    by 0x401DD3: main (ex2.c:81)
>>> ==14816==  Address 0x101bb398 is 0 bytes after a block of size 72 alloc'd
>>> ==14816==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>> ==14816==    by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
>>> ==14816==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>> ==14816==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>> ==14816==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>> ==14816==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>> ==14816==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>> ==14816==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>> ==14816==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>> ==14816==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>> ==14816==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>> ==14816==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>> ==14816==    by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243)
>>> ==14816==    by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
>>> ==14816==    by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695)
>>> ==14816==    by 0x9A9E9F2: opal_progress (opal_progress.c:187)
>>> ==14816==    by 0xECCA70A: opal_condition_wait (condition.h:78)
>>> ==14816==    by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
>>> ==14816==    by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
>>> ==14816==    by 0xFD8D951: ompi_coll_tuned_scatter_intra_basic_linear (coll_tuned_scatter.c:231)
>>> ==14816==    by 0xFD7A66D: ompi_coll_tuned_scatter_intra_dec_fixed (coll_tuned_decision_fixed.c:769)
>>> ==14816==    by 0xF0F3BDB: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:102)
>>> ==14816==    by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140)
>>> ==14816==    by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
>>> ==14816==    by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
>>> ==14816==    by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
>>> ==14816==    by 0x51303D6: PetscSFBcastEnd (sf.c:957)
>>> ==14816==    by 0x401DD3: main (ex2.c:81)
>>> ==14816==
>>> Vec Object: 2 MPI processes
>>> type: mpi
>>> Process [0]
>>> 0
>>> 1
>>> Process [1]
>>> 1
>>> 0
>>> Vec Object: 2 MPI processes
>>> type: mpi
>>> Process [0]
>>> 10
>>> 11
>>> Process [1]
>>> 11
>>> 10
>>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>>> $
>>> 
>>> 
>>> On Thu, 9 Apr 2015, Barry Smith wrote:
>>> 
>>>> 
>>>> Satish,
>>>> 
>>>> Why are you telling me :-). Tell the person who's been pushing this stuff into PETSc and he can debug it.
>>>> 
>>>> Barry
>>>> 
>>>> This is why "my part" of PETSc only uses MPI 1.1 :-)
>>>> 
>>>> 
>>>> 
>>>>> On Apr 9, 2015, at 5:48 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, 9 Apr 2015, Barry Smith wrote:
>>>>> 
>>>>>> 
>>>>>> http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2015/04/08/examples_master_arch-linux-pkgs-opt_crank.log
>>>>> 
>>>>> 
>>>>> The following test is hanging - perhaps --download-openmpi is the trigger.
>>>>> 
>>>>> 
>>>>> petsc    14547  0.0  0.0  12312  1220 ?        S    13:56   0:00 /bin/sh -c /sandbox/petsc/petsc.clone/arch-linux-pkgs-opt/bin/mpiexec -n 2 ./ex2 -sf_type window > ex2.tmp 2>&1; \?          /usr/bin/diff -w output/ex2_window.out ex2.tmp || printf "/sandbox/petsc/petsc.clone/src/vec/is/sf/examples/tutorials\nPossible problem with ex2_window, diffs above\n=========================================\n"; \?          /bin/rm -f -f ex2.tmp
>>>>> 
>>>>> 
>>>>> 
>>>>> I can reproduce on my laptop [with the following trace].
>>>>> 
>>>>> Satish
>>>>> 
>>>>> ---------
>>>>> 
>>>>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>>>>> $ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 ./ex2 -sf_type window
>>>>> PetscSF Object: 2 MPI processes
>>>>> type: window
>>>>>  synchronization=FENCE sort=rank-order
>>>>> [0] Number of roots=1, leaves=2, remote ranks=2
>>>>> [0] 0 <- (0,0)
>>>>> [0] 1 <- (1,0)
>>>>> [1] Number of roots=1, leaves=2, remote ranks=2
>>>>> [1] 0 <- (1,0)
>>>>> [1] 1 <- (0,0)
>>>>> *** Error in `./ex2': free(): invalid next size (fast): 0x0000000002395ed0 ***
>>>>> [asterix:14290] *** Process received signal ***
>>>>> [asterix:14290] Signal: Aborted (6)
>>>>> [asterix:14290] Signal code:  (-6)
>>>>> ======= Backtrace: =========
>>>>> /lib64/libc.so.6(+0x77d9e)[0x[asterix:14290] [ 0] /lib64/libpthread.so.0(+0x100d0)[0x7f331fac10d0]
>>>>> [asterix:14290] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f331f7288d7]
>>>>> [asterix:14290] [ 2] /home/balay/petsc/arch-ompi/lib/libmpi.so.1(ompi_datatype_release_args/lib64/libc.so.6(abort+0x16a)[0x7f331f72a53a]
>>>>> [asterix:14290] [ 3] /home/balay/petsc/arch-ompi/lib/libmpi.so.1(/lib64/libc.so.6(+0x77da3)[0x7f331f76bda3]
>>>>> [asterix:14290] [ 4] +0x508e3)[0x7f9018c898e3]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x11773/lib64/libc.so.6(cfree+0x5b5)[0x7f331f7779f5]
>>>>> [asterix:14290] [ 5] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x12ece)[0x7f900f73aece]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0x862a)[0x7f900ecdb62a]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0x8a15/home/balay/petsc/arch-ompi/lib/libmpi.so.1(ompi_datatype_release_args+0x12b)[0x7f331ff33627]
>>>>> [asterix:14290] [ 6] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0xbac7)[0x7f900ecdeac7]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0xc0de)[0x7f900f7340de]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0xc4eb)[0x7f900f7344eb]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x2ed)[0x7f900f734f88]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_btl_vader.so(+0x3876)[0x7f9014009876]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_btl_vader.so(+0x4d83)[0x7f901400ad83]
>>>>> /home/balay/petsc/arch-ompi/lib/libopen-pal.so.6(opal_progress+0xa2)[0x7f9017cca9f3]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x770b)[0x7f900f72f70b]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x77f5)[0x7f900f72f7f5]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x1c6)[0x7f900f72ff6a]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_scatter_intra_basic_linear+0x76)[0x7f900e689952]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_scatter_intra_dec_fixed+0x112)[0x7f900e67666e]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_basic.so(mca_coll_basic_reduce_scatter_block_intra+0x188)[0x7f900f319bdc]
>>>>> /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_fence+0x125)[0x7f900ecdfc59]
>>>>> /home/balay/petsc/arch-ompi/lib/libmpi.so.1(MPI_Win_fence+0x116)[0x7f9018cd1079]
>>>>> /home/balay/petsc/arch-ompi/lib/libmpi.so.1(+0x508e3)[0x7f331ff348e3]
>>>>> [asterix:14290] [ 7] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x11773)[0x7f3316910773]
>>>>> [asterix:14290] [ 8] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x12ece)[0x7f3316911ece]
>>>>> [asterix:14290] [ 9] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0x862a)[0x7f3315eb262a]
>>>>> [asterix:14290] [10] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0x8a15)[0x7f3315eb2a15]
>>>>> [asterix:14290] [11] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0xbac7)[0x7f3315eb5ac7]
>>>>> [asterix:14290] [12] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0xc0de)[0x7f331690b0de]
>>>>> [asterix:14290] [13] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0xc4eb)[0x7f331690b4eb]
>>>>> [asterix:14290] [14] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x2ed)[0x7f331690bf88]
>>>>> [asterix:14290] [15] /home/balay/petsc/arch-ompi/lib/openmpi/mca_btl_vader.so(+0x3876)[0x7f3316f4a876]
>>>>> [asterix:14290] [16] /home/balay/petsc/arch-ompi/lib/openmpi/mca_btl_vader.so(+0x4d83)[0x7f3316f4bd83]
>>>>> [asterix:14290] [17] /home/balay/petsc/arch-ompi/lib/libopen-pal.so.6(opal_progress+0xa2)[0x7f331ef759f3]
>>>>> [asterix:14290] [18] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x770b)[0x7f331690670b]
>>>>> [asterix:14290] [19] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x77f5)[0x7f33169067f5]
>>>>> [asterix:14290] [20] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x1c6)[0x7f3316906f6a]
>>>>> [asterix:14290] [21] /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_basic_linear+0x1cb)[0x7f331585c38e]
>>>>> [asterix:14290] [22] /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x1a6)[0x7f331584cc27]
>>>>> [asterix:14290] [23] /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_basic.so(mca_coll_basic_reduce_scatter_block_intra+0x13e)[0x7f33164f0b92]
>>>>> [asterix:14290] [24] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_fence+0x125)[0x7f3315eb6c59]
>>>>> [asterix:14290] [25] /home/balay/petsc/arch-ompi/lib/libmpi.so.1(MPI_Win_fence+0x116)[0x7f331ff7c079]
>>>>> [asterix:14290] [26] /home/balay/petsc/arch-ompi/lib/libpetsc.so.3.05(+0x2d1d90)[0x7f3322855d90]
>>>>> [asterix:14290] [27] /home/balay/petsc/arch-ompi/lib/libpetsc.so.3.05(PetscSFBcastEnd_Window+0x218)[0x7f33228582db]
>>>>> [asterix:14290] [28] /home/balay/petsc/arch-ompi/lib/libpetsc.so.3.05(PetscSFBcastEnd+0x4eb)[0x7f332287f3d7]
>>>>> [asterix:14290] [29] ./ex2[0x401dd4]
>>>>> [asterix:14290] *** End of error message ***
>>>>> [asterix:14291] *** Process received signal ***
>>>>> [asterix:14291] Signal: Aborted (6)
>>>>> [asterix:14291] Signal code:  (-6)
>>>>> [asterix:14291] [ 0] /lib64/libpthread.so.0(+0x100d0)[0x7f90188160d0]
>>>>> [asterix:14291] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f901847d8d7]
>>>>> [asterix:14291] [ 2] /lib64/libc.so.6(abort+0x16a)[0x7f901847f53a]
>>>>> [asterix:14291] [ 3] /lib64/libc.so.6(+0x77da3)[0x7f90184c0da3]
>>>>> [asterix:14291] [ 4] /lib64/libc.so.6(cfree+0x5b5)[0x7f90184cc9f5]
>>>>> [asterix:14291] [ 5] /home/balay/petsc/arch-ompi/lib/libmpi.so.1(ompi_datatype_release_args+0x12b)[0x7f9018c88627]
>>>>> [asterix:14291] [ 6] /home/balay/petsc/arch-ompi/lib/libmpi.so.1(+0x508e3)[0x7f9018c898e3]
>>>>> [asterix:14291] [ 7] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x11773)[0x7f900f739773]
>>>>> [asterix:14291] [ 8] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x12ece)[0x7f900f73aece]
>>>>> [asterix:14291] [ 9] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0x862a)[0x7f900ecdb62a]
>>>>> [asterix:14291] [10] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0x8a15)[0x7f900ecdba15]
>>>>> [asterix:14291] [11] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(+0xbac7)[0x7f900ecdeac7]
>>>>> [asterix:14291] [12] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0xc0de)[0x7f900f7340de]
>>>>> [asterix:14291] [13] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0xc4eb)[0x7f900f7344eb]
>>>>> [asterix:14291] [14] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x2ed)[0x7f900f734f88]
>>>>> [asterix:14291] [15] /home/balay/petsc/arch-ompi/lib/openmpi/mca_btl_vader.so(+0x3876)[0x7f9014009876]
>>>>> [asterix:14291] [16] /home/balay/petsc/arch-ompi/lib/openmpi/mca_btl_vader.so(+0x4d83)[0x7f901400ad83]
>>>>> [asterix:14291] [17] /home/balay/petsc/arch-ompi/lib/libopen-pal.so.6(opal_progress+0xa2)[0x7f9017cca9f3]
>>>>> [asterix:14291] [18] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x770b)[0x7f900f72f70b]
>>>>> [asterix:14291] [19] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(+0x77f5)[0x7f900f72f7f5]
>>>>> [asterix:14291] [20] /home/balay/petsc/arch-ompi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x1c6)[0x7f900f72ff6a]
>>>>> [asterix:14291] [21] /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_scatter_intra_basic_linear+0x76)[0x7f900e689952]
>>>>> [asterix:14291] [22] /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_scatter_intra_dec_fixed+0x112)[0x7f900e67666e]
>>>>> [asterix:14291] [23] /home/balay/petsc/arch-ompi/lib/openmpi/mca_coll_basic.so(mca_coll_basic_reduce_scatter_block_intra+0x188)[0x7f900f319bdc]
>>>>> [asterix:14291] [24] /home/balay/petsc/arch-ompi/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_fence+0x125)[0x7f900ecdfc59]
>>>>> [asterix:14291] [25] /home/balay/petsc/arch-ompi/lib/libmpi.so.1(MPI_Win_fence+0x116)[0x7f9018cd1079]
>>>>> [asterix:14291] [26] /home/balay/petsc/arch-ompi/lib/libpetsc.so.3.05(+0x2d1d90)[0x7f901b5aad90]
>>>>> [asterix:14291] [27] /home/balay/petsc/arch-ompi/lib/libpetsc.so.3.05(PetscSFBcastEnd_Window+0x218)[0x7f901b5ad2db]
>>>>> [asterix:14291] [28] /home/balay/petsc/arch-ompi/lib/libpetsc.so.3.05(PetscSFBcastEnd+0x4eb)[0x7f901b5d43d7]
>>>>> [asterix:14291] [29] ./ex2[0x401dd4]
>>>>> [asterix:14291] *** End of error message ***
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec noticed that process rank 0 with PID 14290 on node asterix exited on signal 6 (Aborted).
>>>>> --------------------------------------------------------------------------
>>>>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>>>>> $ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 valgrind --tool=memcheck -q ./ex2 -sf_type window
>>>>> PetscSF Object: 2 MPI processes
>>>>> type: window
>>>>>  synchronization=FENCE sort=rank-order
>>>>> [0] Number of roots=1, leaves=2, remote ranks=2
>>>>> [0] 0 <- (0,0)
>>>>> [0] 1 <- (1,0)
>>>>> [1] Number of roots=1, leaves=2, remote ranks=2
>>>>> [1] 0 <- (1,0)
>>>>> [1] 1 <- (0,0)
>>>>> ==14349== Invalid write of size 2
>>>>> ==14349==    at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>>>> ==14349==    by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
>>>>> ==14349==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>>>> ==14349==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>>>> ==14349==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>>>> ==14349==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>>>> ==14349==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>>>> ==14349==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>>>> ==14349==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>>>> ==14349==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>>>> ==14349==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>>>> ==14349==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>>>> ==14349==  Address 0x101bf188 is 0 bytes after a block of size 72 alloc'd
>>>>> ==14349==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>>>> ==14349==    by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
>>>>> ==14349==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>>>> ==14349==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>>>> ==14349==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>>>> ==14349==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>>>> ==14349==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>>>> ==14349==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>>>> ==14349==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>>>> ==14349==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>>>> ==14349==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>>>> ==14349==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>>>> ==14349==
>>>>> ==14348== Invalid write of size 2
>>>>> ==14348==    at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>>>> ==14348==    by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
>>>>> ==14348==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>>>> ==14348==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>>>> ==14348==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>>>> ==14348==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>>>> ==14348==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>>>> ==14348==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>>>> ==14348==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>>>> ==14348==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>>>> ==14348==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>>>> ==14348==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>>>> ==14348==  Address 0x101c71b8 is 0 bytes after a block of size 72 alloc'd
>>>>> ==14348==    at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>>>> ==14348==    by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
>>>>> ==14348==    by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718)
>>>>> ==14348==    by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649)
>>>>> ==14348==    by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788)
>>>>> ==14348==    by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52)
>>>>> ==14348==    by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
>>>>> ==14348==    by 0xF72887D: process_get (osc_rdma_data_move.c:536)
>>>>> ==14348==    by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
>>>>> ==14348==    by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656)
>>>>> ==14348==    by 0xECCF0DD: ompi_request_complete (request.h:402)
>>>>> ==14348==    by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
>>>>> ==14348==
>>>>> Vec Object: 2 MPI processes
>>>>> type: mpi
>>>>> Process [0]
>>>>> 0
>>>>> 1
>>>>> Process [1]
>>>>> 1
>>>>> 0
>>>>> Vec Object: 2 MPI processes
>>>>> type: mpi
>>>>> Process [0]
>>>>> 10
>>>>> 11
>>>>> Process [1]
>>>>> 11
>>>>> 10
>>>>> balay at asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
>>>>> $
>>>> 
>>>> 
>>> 
>>> 
>> 
> 




More information about the petsc-dev mailing list