[petsc-dev] Imbalance in MG

Tue Jun 26 19:37:49 CDT 2018

On Fri, Jun 22, 2018 at 3:26 PM Junchao Zhang <jczhang at mcs.anl.gov> wrote:

> I instrumented PCMGMCycle_Private() to pull out some info about the
> matrices and VecScatters used in MatSOR_MPIAIJ, MatMultTranspose_MPIAIJ etc
> at each multigrid level to see how imbalanced they are. In my test, I have
> a 6 x 6 x 6 = 216 processor grid. Each processor has 30 x 30 x 30 grid
> points.  The code uses 7-point stencil.  Except for some boundary points,
> it looks the problem is perfectly balanced.  From the output, I can see
> processors communicate with more and more neighbors as they enter coarser
> grids. For example, non-boundary processors first have 6 face-neighbors,
> then 18 edge-neighbors, then 26 vertex-neighbors, and then even more
> neighbors. At some level, the grid is only on the first few processors and
> others are idle.
>

That is all expected. The 'stencils' get larger on coarser grids and the
code reduces the number of active processors on coarse grids when there is
not enough parallelism available.

> The communication pattern is also imbalanced. For example, at level 3, I
> have
>
>  2172 Entering MG level 3
>  ...
>  2605 doing MatRestrict
>  2606 MatMultTranspose_MPIAIJ: on rank 0 mat has 59 rows, 284 nonzeros,
> send 188 to 33 nbrs,recv 10 from 1 nbrs
>  2607 MatMultTranspose_MPIAIJ: on rank 1 mat has 61 rows, 459 nonzeros,
> send 237 to 38 nbrs,recv 25 from 2 nbrs
>  2608 MatMultTranspose_MPIAIJ: on rank 2 mat has 62 rows, 519 nonzeros,
> send 245 to 28 nbrs,recv 28 from 2 nbrs
>  2609 MatMultTranspose_MPIAIJ: on rank 3 mat has 62 rows, 521 nonzeros,
> send 316 to 47 nbrs,recv 15 from 1 nbrs
>  2610 MatMultTranspose_MPIAIJ: on rank 4 mat has 62 rows, 525 nonzeros,
> send 411 to 62 nbrs,recv 28 from 2 nbrs
>  2611 MatMultTranspose_MPIAIJ: on rank 5 mat has 70 rows, 526 nonzeros,
> send 424 to 49 nbrs,recv 26 from 2 nbrs
>  2612 MatMultTranspose_MPIAIJ: on rank 6 mat has 63 rows, 503 nonzeros,
> send 259 to 41 nbrs,recv 28 from 4 nbrs
>  2613 MatMultTranspose_MPIAIJ: on rank 7 mat has 64 rows, 374 nonzeros,
> send 349 to 62 nbrs,recv 32 from 4 nbrs
>  2614 MatMultTranspose_MPIAIJ: on rank 8 mat has 67 rows, 461 nonzeros,
> send 354 to 51 nbrs,recv 29 from 4 nbrs
>  2615 MatMultTranspose_MPIAIJ: on rank 9 mat has 67 rows, 462 nonzeros,
> send 274 to 42 nbrs,recv 31 from 4 nbrs
>  2616 MatMultTranspose_MPIAIJ: on rank 10 mat has 67 rows, 458 nonzeros,
> send 359 to 62 nbrs,recv 30 from 4 nbrs
>  2617 MatMultTranspose_MPIAIJ: on rank 11 mat has 70 rows, 482 nonzeros,
> send 364 to 51 nbrs,recv 25 from 4 nbrs
>  2618 MatMultTranspose_MPIAIJ: on rank 12 mat has 61 rows, 469 nonzeros,
> send 274 to 42 nbrs,recv 29 from 3 nbrs
>  2619 MatMultTranspose_MPIAIJ: on rank 13 mat has 64 rows, 454 nonzeros,
> send 359 to 62 nbrs,recv 32 from 3 nbrs
>  2620 MatMultTranspose_MPIAIJ: on rank 14 mat has 64 rows, 556 nonzeros,
> send 365 to 51 nbrs,recv 34 from 3 nbrs
>  2621 MatMultTranspose_MPIAIJ: on rank 15 mat has 64 rows, 542 nonzeros,
> send 322 to 31 nbrs,recv 36 from 3 nbrs
>  2622 MatMultTranspose_MPIAIJ: on rank 16 mat has 64 rows, 531 nonzeros,
> send 411 to 44 nbrs,recv 34 from 3 nbrs
>  2623 MatMultTranspose_MPIAIJ: on rank 17 mat has 70 rows, 497 nonzeros,
> send 476 to 36 nbrs,recv 28 from 4 nbrs
>  2624 MatMultTranspose_MPIAIJ: on rank 18 mat has 61 rows, 426 nonzeros,
> send 0 to 0 nbrs,recv 30 from 4 nbrs
>  2625 MatMultTranspose_MPIAIJ: on rank 19 mat has 64 rows, 521 nonzeros,
> send 0 to 0 nbrs,recv 31 from 4 nbrs
>  ...
>
> The machine has 36 cores per node.  It acts as if the first 18 processors
> on the first node are sending small messages to the remaining processors.
> Obviously, there is no way for this to be balanced. Does someone have good
> explanations for that and  know options to get rid of these imbalance?  for
> example, no idle processors, spread-out communications etc.
> Thanks.
>

There are processors that are deactivated on coarse grids. If you want to
mimimize this process reduction then use "-pc_gamg_process_eq_limit 1".
This will reduce the active number of processor only when there are more
processors than  equations, in which case we have not choice because we
partition matrices by (whole) rows. The default is 50, and this is pretty
low, I usually run with about 200. But, this is very architecture and
problem and metric specific so you can do a parameter sweep and measure
where your problems/machine perform best.

Mark

>
> --Junchao Zhang
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180626/7546ff83/attachment.html>