<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 13, 2021 at 11:49 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br>
Fande,<br>
<br>
Look at <a href="https://scm.mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/trunk/src/mpid/ch3/channels/common/src/detect/arch/mv2_arch_detect.c" rel="noreferrer" target="_blank">https://scm.mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/trunk/src/mpid/ch3/channels/common/src/detect/arch/mv2_arch_detect.c</a><br>
<br>
cpubind_set = hwloc_bitmap_alloc();<br>
<br>
but I don't find a corresponding hwloc_bitmap_free(cpubind_set ); in get_socket_bound_info().<br></blockquote><div><br></div><div>Thanks. I added hwloc_bitmap_free(cpubind_set ) to the end of get_socket_bound_info(). And then these valgrind messages disappeared.</div><div><br></div><div>Will ask mvapich developers to fix this.</div><div><br></div><div>Thanks,</div><div><br></div><div>Fande, </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<br>
<br>
Barry<br>
<br>
<br>
> <br>
<br>
> On Jan 13, 2021, at 12:32 PM, Fande Kong <<a href="mailto:fdkong.jd@gmail.com" target="_blank">fdkong.jd@gmail.com</a>> wrote:<br>
> <br>
> Hi All,<br>
> <br>
> I ran valgrind with mvapich-2.3.5 for a moose simulation. The motivation was that we have a few non-deterministic parallel simulations in moose. I want to check if we have any memory issues. I got some complaints from PetscAllreduceBarrierCheck<br>
> <br>
> Thanks,<br>
> <br>
> <br>
> Fande<br>
> <br>
> <br>
> <br>
> ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 31 of 54<br>
> ==98001== at 0x4C29F73: malloc (vg_replace_malloc.c:307)<br>
> ==98001== by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74)<br>
> ==98001== by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898)<br>
> ==98001== by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593)<br>
> ==98001== by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762)<br>
> ==98001== by 0xD59A894: mv2_increment_shmem_coll_counter (ch3_shmem_coll.c:2183)<br>
> ==98001== by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912)<br>
> ==98001== by 0x99F1766: PetscAllreduceBarrierCheck (pbarrier.c:26)<br>
> ==98001== by 0x99F70BE: PetscSplitOwnership (psplit.c:84)<br>
> ==98001== by 0x9C5C26B: PetscLayoutSetUp (pmap.c:262)<br>
> ==98001== by 0xA08C66B: MatMPIAdjSetPreallocation_MPIAdj (mpiadj.c:630)<br>
> ==98001== by 0xA08EB9A: MatMPIAdjSetPreallocation (mpiadj.c:856)<br>
> ==98001== by 0xA08F6D3: MatCreateMPIAdj (mpiadj.c:904)<br>
> <br>
> <br>
> ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 32 of 54<br>
> ==98001== at 0x4C29F73: malloc (vg_replace_malloc.c:307)<br>
> ==98001== by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74)<br>
> ==98001== by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898)<br>
> ==98001== by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593)<br>
> ==98001== by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762)<br>
> ==98001== by 0xD59A9A4: mv2_increment_allgather_coll_counter (ch3_shmem_coll.c:2218)<br>
> ==98001== by 0xD4E4CE4: PMPI_Allreduce (allreduce.c:917)<br>
> ==98001== by 0xCD9D74D: libparmetis__gkMPI_Allreduce (gkmpi.c:103)<br>
> ==98001== by 0xCDBB663: libparmetis__ComputeParallelBalance (stat.c:87)<br>
> ==98001== by 0xCDA4FE0: libparmetis__KWayFM (kwayrefine.c:352)<br>
> ==98001== by 0xCDA21ED: libparmetis__Global_Partition (kmetis.c:222)<br>
> ==98001== by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191)<br>
> ==98001== by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191)<br>
> ==98001== by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191)<br>
> ==98001== by 0xCDA20B2: libparmetis__Global_Partition (kmetis.c:191)<br>
> ==98001== by 0xCDA2748: ParMETIS_V3_PartKway (kmetis.c:94)<br>
> ==98001== by 0xA2D6B39: MatPartitioningApply_Parmetis_Private (pmetis.c:145)<br>
> ==98001== by 0xA2D77D9: MatPartitioningApply_Parmetis (pmetis.c:219)<br>
> ==98001== by 0xA2CD46A: MatPartitioningApply (partition.c:332)<br>
> <br>
> <br>
> ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 33 of 54<br>
> ==98001== at 0x4C29F73: malloc (vg_replace_malloc.c:307)<br>
> ==98001== by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74)<br>
> ==98001== by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898)<br>
> ==98001== by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593)<br>
> ==98001== by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762)<br>
> ==98001== by 0xD59A894: mv2_increment_shmem_coll_counter (ch3_shmem_coll.c:2183)<br>
> ==98001== by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912)<br>
> ==98001== by 0x99F1766: PetscAllreduceBarrierCheck (pbarrier.c:26)<br>
> ==98001== by 0x99F733E: PetscSplitOwnership (psplit.c:91)<br>
> ==98001== by 0x9C5C26B: PetscLayoutSetUp (pmap.c:262)<br>
> ==98001== by 0x9C5DB0D: PetscLayoutCreateFromSizes (pmap.c:112)<br>
> ==98001== by 0x9D9A018: ISGeneralSetIndices_General (general.c:568)<br>
> ==98001== by 0x9D9AB44: ISGeneralSetIndices (general.c:554)<br>
> ==98001== by 0x9D9ADC4: ISCreateGeneral (general.c:529)<br>
> ==98001== by 0x9B431E6: VecCreateGhostWithArray (pbvec.c:692)<br>
> ==98001== by 0x9B43A33: VecCreateGhost (pbvec.c:748)<br>
> <br>
> <br>
> ==98001== 88 (24 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 34 of 54<br>
> =98001== at 0x4C29F73: malloc (vg_replace_malloc.c:307)<br>
> ==98001== by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74)<br>
> ==98001== by 0xDA7523F: get_socket_ <br>
> =<br>
> bound_info (mv2_arch_detect.c:898)<br>
> ==98001== by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593)<br>
> ==98001== by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762)<br>
> ==98001== by 0xD59A894: mv2_increment_shmem_coll_counter (ch3_shmem_coll.c:2183)<br>
> ==98001== by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912)<br>
> ==98001== by 0x9B0B5F3: VecSetSizes (vector.c:1318)<br>
> ==98001== by 0x9B42DDC: VecCreateMPIWithArray (pbvec.c:625)<br>
> ==98001== by 0xA7CF280: PCSetUp_Redundant (redundant.c:125)<br>
> ==98001== by 0xA7BB0CE: PCSetUp (precon.c:1009)<br>
> ==98001== by 0xAA2A9B9: KSPSetUp (itfunc.c:406)<br>
> ==98001== by 0xA92C490: PCSetUp_MG (mg.c:907)<br>
> ==98001== by 0xA93CAE9: PCSetUp_HMG (hmg.c:220)<br>
> ==98001== by 0xA7BB0CE: PCSetUp (precon.c:1009)<br>
> ==98001== by 0xAA2A9B9: KSPSetUp (itfunc.c:406)<br>
> ==98001== by 0xAA2B2E9: KSPSolve_Private (itfunc.c:658)<br>
> ==98001== by 0xAA2E8B7: KSPSolve (itfunc.c:889)<br>
> ==98001== by 0xAC33950: SNESSolve_NEWTONLS (ls.c:225)<br>
> ==98001== by 0xABD95AA: SNESSolve (snes.c:4569)<br>
> <br>
> <br>
> ==98001== 176 (48 direct, 128 indirect) bytes in 2 blocks are definitely lost in loss record 39 of 54<br>
> ==98001== at 0x4C29F73: malloc (vg_replace_malloc.c:307)<br>
> ==98001== by 0xDAE1D5E: hwloc_bitmap_alloc (bitmap.c:74)<br>
> ==98001== by 0xDA7523F: get_socket_bound_info (mv2_arch_detect.c:898)<br>
> ==98001== by 0xD93C87A: create_intra_sock_comm (create_2level_comm.c:593)<br>
> ==98001== by 0xD93BEBA: create_2level_comm (create_2level_comm.c:1762)<br>
> ==98001== by 0xD59A894: mv2_increment_shmem_coll_counter (ch3_shmem_coll.c:2183)<br>
> ==98001== by 0xD4E4CBB: PMPI_Allreduce (allreduce.c:912)<br>
> ==98001== by 0xBAB3848: hypre_MPI_Allreduce (mpistubs.c:1180)<br>
> ==98001== by 0xB96093E: hypre_ParCSRMatrixSetNumNonzeros_core (par_csr_matrix.c:383)<br>
> ==98001== by 0xB960A11: hypre_ParCSRMatrixSetDNumNonzeros (par_csr_matrix.c:413)<br>
> ==98001== by 0xB8B32DF: hypre_BoomerAMGSetup (par_amg_setup.c:2784)<br>
> ==98001== by 0xB8A0D26: HYPRE_BoomerAMGSetup (HYPRE_parcsr_amg.c:47)<br>
> ==98001== by 0xA9EDB17: PCSetUp_HYPRE (hypre.c:408)<br>
> ==98001== by 0xA7BB0CE: PCSetUp (precon.c:1009)<br>
> ==98001== by 0xA93B2DB: PCSetUp_HMG (hmg.c:161)<br>
> ==98001== by 0xA7BB0CE: PCSetUp (precon.c:1009)<br>
> ==98001== by 0xAA2A9B9: KSPSetUp (itfunc.c:406)<br>
> ==98001== by 0xAA2B2E9: KSPSolve_Private (itfunc.c:658)<br>
> ==98001== by 0xAA2E8B7: KSPSolve (itfunc.c:889)<br>
> ==98001== by 0xAC33950: SNESSolve_NEWTONLS (ls.c:225)<br>
> <br>
> <br>
> <br>
> <br>
<br>
</blockquote></div></div></div></div>