[petsc-users] DMPlexDistribute segmentation fault

Matthew Knepley knepley at gmail.com
Sat Sep 29 10:59:49 CDT 2018


On Fri, Sep 28, 2018 at 9:57 PM Josh L <ysjosh.lo at gmail.com> wrote:

> Hi,
>
> I am implementing DMPlex to handle my mesh in parallel, but I am getting
> segmentation fault during DMPlexDistribute.
> i am testing with 2 processors on a toy mesh that only has 4 quad elements.
>
> the code is as following
>       CALL DMPlexCreateFromCellList
>       Check mesh topology. Face, skeleton symmetry
>      CALL  DMPlexDistribute (dmserial,0,PETSC_NULL_SF,dmmpi,ierr) .
> (segmentation fault on rank #1  here)
>

Crud, I was missing a check that the overlap is different on different
procs. This is now fixed.

Second it might be that your Fortran ints are not the same as PetscInt. Try
declaring it

  PetscInt overlap = 0

and then passing 'overlap' instead.

  Thanks,

     Matt


> I trace it back to the external library "Chaco" called by PETSc to
> partition the mesh.
>
> The following is the stack
>
> For rank #1
>
> #10 main () at /work/03691/yslo/stampede2/linear/rewrite/src/main.F90:18
> (at 0x000000000040569d)
> #9 createmesh (rank=1, nsize=2, dmmpi=...) at
> /work/03691/yslo/stampede2/linear/rewrite/src/createmesh.F90:107 (at
> 0x000000000040523b)
> #8 dmplexdistribute_ (dm=0x320, overlap=0x1, sf=0x7fffffff4480,
> dmParallel=0x2aaab78a585d, ierr=0x1) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/ftn-custom/zplexdistribute.c:15
> (at 0x00002aaaac08334e)
> #7 DMPlexDistribute (dm=0x320, overlap=1, sf=0x7fffffff4480,
> dmParallel=0x2aaab78a585d) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/plexdistribute.c:1664
> (at 0x00002aaaabd12d08)
> #6 PetscPartitionerPartition (part=0x320, dm=0x1,
> partSection=0x7fffffff4480, partition=0x2aaab78a585d) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/plexpartition.c:675
> (at 0x00002aaaabcfe92c)
> #5 PetscPartitionerPartition_Chaco (part=0x320, dm=0x1, nparts=-48000,
> numVertices=-1215670179, start=0x1, adjacency=0x22, partSection=0x0,
> partition=0x772120) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/plexpartition.c:1279
> (at 0x00002aaaabd03dfd)
> #4 interface.Z (nvtxs=800, start=0x1, adjacency=0x7fffffff4480,
> vwgts=0x2aaab78a585d, ewgts=0x1, x=0x22, y=0x0, z=0x0, outassignname=0x0,
> outfilename=0x0, assignment=0x76ff90, architecture=1, ndims_tot=0,
> mesh_dims=0x7fffffff7038, goal=0x0, global_method=1, local_method=1,
> rqi_flag=0, vmax=200, ndims=1, eigtol=2.1137067449068142e-314,
> seed=123636512) at
> /tmp/petsc-build/externalpackages/skylake/skylake/Chaco-2.2-p2/code/main/interface.c:206
> (at 0x00002aaaad7aaf70)
> #3 submain.Z (graph=0x320, nvtxs=1, nedges=-48000,
> using_vwgts=-1215670179, using_ewgts=1, igeom=34, coords=0x0,
> outassignname=0x0, outfilename=0x0, assignment=0x76ff8e, goal=0x7739a0,
> architecture=1, ndims_tot=0, mesh_dims=0x7fffffff7038, global_method=1,
> local_method=1, rqi_flag=0, vmax=200, ndims=1,
> eigtol=2.1137067449068142e-314, seed=123636512) at
> /tmp/petsc-build/externalpackages/skylake/skylake/Chaco-2.2-p2/code/submain/submain.c:151
> (at 0x00002aaaad7ae52e)
> #2 check_input.Z (graph=0x320, nvtxs=1, nedges=-48000, igeom=-1215670179,
> coords=0x1, graphname=0x22 <error: Cannot access memory at address 0x22>,
> assignment=0x76ff8e, goal=0x7739a0, architecture=1, ndims_tot=0,
> mesh_dims=0x7fffffff7038, global_method=1, local_method=1, rqi_flag=0,
> vmax=0x7fffffff47a8, ndims=1, eigtol=2.1137067449068142e-314) at
> /tmp/petsc-build/externalpackages/skylake/skylake/Chaco-2.2-p2/code/input/check_input.c:56
> (at 0x00002aaaad7c6ed1)
> #1 check_graph (graph=0x320, nvtxs=1, nedges=-48000) at
> /tmp/petsc-build/externalpackages/skylake/skylake/Chaco-2.2-p2/code/graph/check_graph.c:90
> (at 0x00002aaaad8204d3)
> #0 is_an_edge (vertex=0x320, v2=1, weight2=0x7fffffff4480) at
> /tmp/petsc-build/externalpackages/skylake/skylake/Chaco-2.2-p2/code/graph/check_graph.c:134
> (at 0x00002aaaad8206d3)
>
> For rank #0
>
> #18 main () at /work/03691/yslo/stampede2/linear/rewrite/src/main.F90:18
> (at 0x000000000040569d)
> #17 createmesh (rank=0, nsize=2, dmmpi=...) at
> /work/03691/yslo/stampede2/linear/rewrite/src/createmesh.F90:107 (at
> 0x000000000040523b)
> #16 dmplexdistribute_ (dm=0x65f300, overlap=0x0, sf=0x2aaab4c8698c,
> dmParallel=0xffffffffffffffff, ierr=0x0) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/ftn-custom/zplexdistribute.c:15
> (at 0x00002aaaac08334e)
> #15 DMPlexDistribute (dm=0x65f300, overlap=0, sf=0x2aaab4c8698c,
> dmParallel=0xffffffffffffffff) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/plexdistribute.c:1664
> (at 0x00002aaaabd12d08)
> #14 PetscPartitionerPartition (part=0x65f300, dm=0x0,
> partSection=0x2aaab4c8698c, partition=0xffffffffffffffff) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/plexpartition.c:675
> (at 0x00002aaaabcfe92c)
> #13 PetscPartitionerPartition_Chaco (part=0x65f300, dm=0x0,
> nparts=-1261934196, numVertices=-1, start=0x0, adjacency=0x0,
> partSection=0x7726c0, partition=0x7fffffff71a8) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/dm/impls/plex/plexpartition.c:1314
> (at 0x00002aaaabd04029)
> #12 ISCreateGeneral (comm=6681344, n=0, idx=0x2aaab4c8698c,
> mode=(PETSC_OWN_POINTER | PETSC_USE_POINTER | unknown: 4294967292), is=0x0)
> at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/vec/is/is/impls/general/general.c:671
> (at 0x00002aaaab0ea94e)
> #11 ISGeneralSetIndices (is=0x65f300, n=0, idx=0x2aaab4c8698c,
> mode=(PETSC_OWN_POINTER | PETSC_USE_POINTER | unknown: 4294967292)) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/vec/is/is/impls/general/general.c:698
> (at 0x00002aaaab0eaa77)
> #10 ISGeneralSetIndices_General.Z (is=0x65f300, n=0, idx=0x2aaab4c8698c,
> mode=(PETSC_OWN_POINTER | PETSC_USE_POINTER | unknown: 4294967292)) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/vec/is/is/impls/general/general.c:712
> (at 0x00002aaaab0ef567)
> #9 PetscLayoutSetUp (map=0x65f300) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/vec/is/utils/pmap.c:137 (at
> 0x00002aaaab0c6d52)
> #8 PetscSplitOwnership (comm=6681344, n=0x0, N=0x2aaab4c8698c) at
> /home1/apps/intel17/impi17_0/petsc/3.9/src/sys/utils/psplit.c:80 (at
> 0x00002aaaaaeb7b00)
> #7 PMPI_Allreduce (sendbuf=0x65f300, recvbuf=0x0, count=-1261934196,
> datatype=-1, op=0, comm=0) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/mpi/coll/allreduce.c:1395
> (at 0x00002aaab40966e6)
> #6 MPIR_Allreduce_intra (sendbuf=0x65f300, recvbuf=0x0, count=-1261934196,
> datatype=-1, op=0, comm_ptr=0x0, errflag=0x7fffffff4798) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/mpi/coll/allreduce.c:339
> (at 0x00002aaab409307b)
> #5 MPIR_Allreduce_shm_generic (sendbuf=<optimized out>, recvbuf=<optimized
> out>, count=<optimized out>, datatype=<optimized out>, op=<optimized out>,
> comm_ptr=<optimized out>, errflag=<optimized out>, kind=1476395011) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/mpi/coll/allreduce.c:1137
> (at 0x00002aaab409307b)
> #4 I_MPI_COLL_SHM_KNARY_REDUCE (node_comm_ptr=<optimized out>,
> root=<optimized out>, localbuf=<optimized out>, sendbuf=<optimized out>,
> recvbuf=<optimized out>, count=<optimized out>, datatype=<optimized out>,
> op=<optimized out>, errflag=<optimized out>, knomial_factor=<optimized
> out>) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/I_MPI/include/shm_coll_templating.h:1090
> (at 0x00002aaab409307b)
> #3 I_MPI_COLL_SHM_GENERIC_GATHER_REDUCE..1 (node_comm_ptr=0x65f300,
> root=0, is_reduce=-1261934196, localbuf=0xffffffffffffffff, sendbuf=0x0,
> recvbuf=0x0, count=1, datatype=1275069445, op=1476395011,
> errflag=0x7fffffff4798, knomial_factor=4, algo_type=2) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/I_MPI/include/shm_coll_templating.h:558
> (at 0x00002aaab408f3a9)
> #2 I_MPI_memcpy (destination=<optimized out>, source=0x0, size=<optimized
> out>) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/I_MPI/include/shm_coll_templating.h:749
> (at 0x00002aaab408f3a9)
> #1 PMPIDI_CH3I_Progress (progress_state=0x65f300, is_blocking=0) at
> /tmp/mpi.xtmpdir.7b663e0dc22b2304e487307e376dc132.14974_32e/mpi.32e.ww14.20170405/dev/x86_64/release_mt/../../src/mpid/ch3/channels/nemesis/src/ch3_progress.c:981
> (at 0x00002aaab40e85a6)
> #0 sched_yield () from /lib64/libc.so.6 (at 0x00002aaab7898e47)
>
>
> Any idea why this is happening? the overlap is set to 0, but it is 1 on
> rank #1
>
>
> Is there anyway to know the cell and vertex number distributed on each
> processors?
> my old code partitions the mesh with Metis, and I always output cell data
> that shows the rank number on each cell, so I can visualize how mesh is
> partitioned.
> It is not necessary, but is there anyway to get it in DMPlex?
> DMPlexGetCellNumbering might work, but it fails the linking. In fact, many
> function under developer category fails the linking.
>
>
> Thanks,
> Yu-Sheng
>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180929/44617523/attachment.html>


More information about the petsc-users mailing list