[petsc-dev] Parmetis bug

Mark Adams mfadams at lbl.gov
Sun Nov 10 15:24:01 CST 2019


Fande, It looks to me like this branch in ParMetis must be taken to trigger
this error. First *Match_SHEM* and then CreateCoarseGraphNoMask.

   /* determine which matching scheme you will use */
    switch (ctrl->ctype) {
      case METIS_CTYPE_RM:
        Match_RM(ctrl, graph);
        break;
      case METIS_CTYPE_SHEM:
        if (eqewgts || graph->nedges == 0)
          Match_RM(ctrl, graph);
        else

*          Match_SHEM(ctrl, graph);*        break;
      default:
        gk_errexit(SIGERR, "Unknown ctype: %d\n", ctrl->ctype);
    }

-----------------------

  /* Check if the mask-version of the code is a good choice */
  mask = HTLENGTH;
  if (cnvtxs < 2*mask || graph->nedges/graph->nvtxs > mask/20) {
    CreateCoarseGraphNoMask(ctrl, graph, cnvtxs, match);
    return;
  }

------------

The actual error is in CreateCoarseGraphNoMask, graph->cmap is too small
and this gets garbage. parmetis coarsen.c:856:

    istart = xadj[v];
    iend   = xadj[v+1];
    for (j=istart; j<iend; j++) {

*      k = cmap[adjncy[j]];*      if ((m = htable[k]) == -1) {
        cadjncy[nedges] = k;

 Anyway, this is what I have so far,
Mark


On Sun, Nov 10, 2019 at 10:34 AM Mark Adams <mfadams at lbl.gov> wrote:

> Fande, the problem is k below seems to index beyond the end of htable,
> resulting in a crazy m and a segv on the last line below.
>
> I don't have a clean valgrind machine now, that is what is needed if no
> one has seen anything like this. I could add a test in a MR and get the
> pipeline to do it.
>
> void CreateCoarseGraphNoMask(ctrl_t *ctrl, graph_t *graph, idx_t cnvtxs,
>          idx_t *match)
> {
>   idx_t j, k, m, istart, iend, nvtxs, nedges, ncon, cnedges, v, u, dovsize;
>   idx_t *xadj, *vwgt, *vsize, *adjncy, *adjwgt;
>   idx_t *cmap, *htable;
>   idx_t *cxadj, *cvwgt, *cvsize, *cadjncy, *cadjwgt;
>   graph_t *cgraph;
> ine
>   WCOREPUSH;
>
>   dovsize = (ctrl->objtype == METIS_OBJTYPE_VOL ? 1 : 0);
>
>   IFSET(ctrl->dbglvl, METIS_DBG_TIME, gk_startcputimer(ctrl->ContractTmr));
>
>   nvtxs   = graph->nvtxs;
>   ncon    = graph->ncon;
>   xadj    = graph->xadj;
>   vwgt    = graph->vwgt;
>   vsize   = graph->vsize;
>   adjncy  = graph->adjncy;
>   adjwgt  = graph->adjwgt;
>   cmap    = graph->cmap;
>
>
>   /* Initialize the coarser graph */
>   cgraph = SetupCoarseGraph(graph, cnvtxs, dovsize);
>   cxadj    = cgraph->xadj;
>   cvwgt    = cgraph->vwgt;
>   cvsize   = cgraph->vsize;
>   cadjncy  = cgraph->adjncy;
>   cadjwgt  = cgraph->adjwgt;
>
>   htable = iset(cnvtxs, -1, iwspacemalloc(ctrl, cnvtxs));
>
>   cxadj[0] = cnvtxs = cnedges = 0;
>   for (v=0; v<nvtxs; v++) {
>     if ((u = match[v]) < v)
>       continue;
>
>     ASSERT(cmap[v] == cnvtxs);
>     ASSERT(cmap[match[v]] == cnvtxs);
>
>     if (ncon == 1)
>       cvwgt[cnvtxs] = vwgt[v];
>     else
>       icopy(ncon, vwgt+v*ncon, cvwgt+cnvtxs*ncon);
>
>     if (dovsize)
>       cvsize[cnvtxs] = vsize[v];
>
>     nedges = 0;
>
>     istart = xadj[v];
>     iend   = xadj[v+1];
>     for (j=istart; j<iend; j++) {
>       k = cmap[adjncy[j]];
>       if ((m = htable[k]) == -1) {
>         cadjncy[nedges] = k;
>         cadjwgt[nedges] = adjwgt[j];
>         htable[k] = nedges++;
>       }
>       else {
>         cadjwgt[m] += adjwgt[j];
>
> On Sun, Nov 10, 2019 at 1:35 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>>
>>
>> On Sat, Nov 9, 2019 at 10:51 PM Fande Kong <fdkong.jd at gmail.com> wrote:
>>
>>> Hi Mark,
>>>
>>> Thanks for reporting this bug. I was surprised because we have
>>> sufficient heavy tests in moose using partition weights and do not have any
>>> issue so far.
>>>
>>>
>> I have been pounding on this code with elasticity and have not seen this
>> issue. I am now looking at Lapacianas and I only see it with pretty large
>> problems. The example below is pretty minimal (eg, it works with 16 cores
>> and it works with -dm_refine 4). I have reproduced this on Cori, SUMMIT and
>> my laptop.
>>
>>
>>> I will take a shot on this.
>>>
>>
>> Thanks, I'll try to take a look at it also. I have seen it in DDT, but
>> did not dig further. It looked like a typical segv in ParMetis.
>>
>>
>>>
>>> Fande,
>>>
>>> On Sat, Nov 9, 2019 at 3:08 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> snes/ex13 is getting a ParMetis segv with GAMG and coarse grid
>>>> repartitioning. Below shows the branch and how to run it.
>>>>
>>>> I've tried valgrind on Cori but it gives a lot of false positives. I've
>>>> seen this error in DDT but I have not had a chance to dig and try to fix
>>>> it. At least I know it has something to do with weights.
>>>>
>>>> If anyone wants to take a shot at it feel free. This bug rarely happens.
>>>>
>>>> The changes use weights and are just a few lines of code (from 1.5
>>>> years ago):
>>>>
>>>> 12:08 (0455fb9fec...)|BISECTING ~/Codes/petsc$ git bisect bad
>>>> 0455fb9fecf69cf5cf35948c84d3837e5a427e2e is the first bad commit
>>>> commit 0455fb9fecf69cf5cf35948c84d3837e5a427e2e
>>>> Author: Fande Kong <fdkong.jd at gmail.com>
>>>> Date:   Thu Jun 21 18:21:19 2018 -0600
>>>>
>>>>     Let parmetis and ptsotch take edge weights and vertex weights
>>>>
>>>>  src/mat/partition/impls/pmetis/pmetis.c | 7 +++++++
>>>>  src/mat/partition/impls/scotch/scotch.c | 6 +++---
>>>>  2 files changed, 10 insertions(+), 3 deletions(-)
>>>>
>>>> > mpiexec -n 32 ./ex13 -cells 2,4,4, -dm_refine 5 -simplex 0 -dim 3
>>>> -potential_petscspace_degree 1 -potential_petscspace_order 1 -pc_type gamg
>>>> -petscpartitioner_type simple -pc_gamg_repartition
>>>> true -check_pointer_intensity 0
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20191110/a61cb827/attachment.html>


More information about the petsc-dev mailing list