[petsc-dev] DMPlexInterpolate after DMPlexDistribute

Tue Apr 28 06:05:01 CDT 2020

On Tue, Apr 28, 2020 at 7:03 AM Stefano Zampini <stefano.zampini at gmail.com>
wrote:

> I think the slowdown in the second test (interpolate 0) is coming from the
> fact that Plex has to compute the dual graph on the fly, see here
> https://gitlab.com/petsc/petsc/-/blob/master/src/dm/impls/plex/plexpartition.c#L469
> .
>

You are right.

   Matt

> Are you having a performance regression with DMPLEX on this second case,
> or is it just the first time you noticed it?
>
> Il giorno mar 28 apr 2020 alle ore 13:57 Matthew Knepley <
> knepley at gmail.com> ha scritto:
>
>> On Tue, Apr 28, 2020 at 5:19 AM Pierre Jolivet <
>> pierre.jolivet at enseeiht.fr> wrote:
>>
>>> On 14 Apr 2020, at 2:36 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>
>>> On Tue, Apr 14, 2020 at 6:36 AM Pierre Jolivet <
>>> pierre.jolivet at enseeiht.fr> wrote:
>>>
>>>> Hello,
>>>> I’d like to call DMPlexInterpolate after DMPlexDistribute and not the
>>>> other way around for performance reasons (please stop me here if this is
>>>> equivalent).
>>>> When there is no overlap in DMPlexDistribute, it goes through fine.
>>>> If there is overlap, I run into an error.
>>>> Is this the expected behavior?
>>>
>>>
>>> Sorry for taking so long to get back at this.
>>> I thought I got everything setup, but when looking at the performance,
>>> I’m quite surprised.
>>> I rewound and looked at src/dm/impls/plex/tutorials/ex2.c by just adding
>>> the DMPlexDistribute step. I guess this is equivalent to your steps #1 and
>>> #2.
>>> diff --git a/src/dm/impls/plex/tutorials/ex2.c
>>> b/src/dm/impls/plex/tutorials/ex2.c
>>> index a069d922b2..5e5c4fb584 100644
>>> --- a/src/dm/impls/plex/tutorials/ex2.c
>>> +++ b/src/dm/impls/plex/tutorials/ex2.c
>>> @@ -65,2 +65,5 @@ static PetscErrorCode CreateMesh(MPI_Comm comm, AppCtx
>>> *user, DM *dm)
>>>      ierr = DMPlexCreateFromFile(comm, user->filename,
>>> user->interpolate, dm);CHKERRQ(ierr);
>>> +    DM dmParallel;
>>> +    ierr = DMPlexDistribute(*dm, 0, NULL, &dmParallel);CHKERRQ(ierr);
>>> +    ierr = DMDestroy(&dmParallel);CHKERRQ(ierr);
>>>    }
>>>
>>> With a cube of 741000 nodes 4574068 elements.
>>> $ cat untitled.geo && ~/gmsh-4.5.4-Linux64/bin/gmsh untitled.geo -bin -3
>>> //+
>>> SetFactory("OpenCASCADE");
>>> Box(1) = {-0.5, -0.5, 0, 1, 1, 1};
>>> Characteristic Length {:} = 0.01;
>>>
>>> I get the following timings (optimized build, scalar-type=real,
>>> 64-bit-indices=false, SKL cluster).
>>> $ mpirun  -n 120 ./ex2 -filename untitled.msh -log_view -interpolate true
>>> DMPlexInterp           1 1.0 1.0444e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 3.0e+00 73  0  0  0  5  73  0  0  0  6     0
>>> DMPlexDistribute       1 1.0 2.8793e+01 1.0 0.00e+00 0.0 3.6e+03 6.2e+05
>>> 3.3e+01 21  0100100 60  21  0100100 69     0
>>> $ mpirun  -n 120 ./ex2 -filename untitled.msh -log_view -interpolate
>>> false
>>> DMPlexDistribute       1 1.0 7.0265e+02 1.0 0.00e+00 0.0 3.1e+03 2.0e+05
>>> 2.6e+01 99  0100100 58  99  0100100 68     0
>>>
>>> Do you think I messed up something else?
>>> Our legacy implementation, on the same mesh, takes around 20 seconds
>>> (accounting for the ParMETIS call) to partition and distribute the mesh and
>>> generate the underlying structures for our FEM kernels (kd tree,
>>> distributed numbering, so on and so forth).
>>> Of course, DMPlex is much more versatile and generic, so I’d understand
>>> if it’s a little bit slower, but that’s quite a lot slower right now.
>>> Are there any option I could try to play with to speed things up?
>>>
>>
>> Yes, something is messed up. Vaclav's latest paper has almost the exact
>> setup for the cube benchmark (https://arxiv.org/pdf/2004.08729.pdf), and
>> you can see that the interpolation time is an order of magnitude smaller
>> for 128 procs and also scales linearly. What you show above in ex2 looks
>> serial, in that the mesh is loaded on 1 proc, and then interpolation is
>> also done on 1 proc, which corresponds to the time he shows as "Serial
>> Startup". I cannot explain the poor performance of your second run, however
>> the interpolation time in the first can be greatly reduced by interpolating
>> in parallel. Thus, the right answer is to load in parallel, interpolation,
>> and redistribute. That is what we do in that paper.
>>
>> Short answer: If you want scalable load and interpolation, I think you
>> should be using the parallel stuff that Vaclav just put in. I think that
>> means first converting your mesh to the HDF5 format using CreateFromFile
>> and then DMView to hdf5. Then you can load it in parallel. Vaclav, has
>> everything been pushed for this?
>>
>>   Thanks,
>>
>>       Matt
>>
>> Thanks in advance for your help,
>>> Pierre
>>>
>>> Yes. What we want you to do is:
>>>
>>> 1) Load/generate mesh
>>>
>>> 2) Distribute (this can be done at the same time as load with parallel
>>> load)
>>>
>>> 3) Interpolate (this is also an option from parallel load)
>>>
>>> 4) If necessary, redistribute for load balance
>>>
>>> 5) Construct overlap
>>>
>>>     When you pass '1' below to DMDistribute(), it distributes as normal
>>> and then calls
>>>
>>>
>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/DMPLEX/DMPlexDistributeOverlap.html
>>>
>>>     at the end. So you just postpone calling that until you have
>>> interpolated.
>>>
>>>   Thanks,
>>>
>>>     Matt
>>>
>>>
>>>> Here is a MWE.
>>>> $ patch -p1 < patch.txt
>>>> $ cd src/dm/impls/plex/tests/
>>>> $ make ex18
>>>> $ mpirun -n 2 ./ex18 -distribute -interpolate after_distribute
>>>> [0]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>> [0]PETSC ERROR: Petsc has generated inconsistent data
>>>> [0]PETSC ERROR: Point SF contains 1 which is a cell
>>>>
>>>> Thanks,
>>>> Pierre
>>>>
>>>> diff --git a/src/dm/impls/plex/tests/ex18.c
>>>> b/src/dm/impls/plex/tests/ex18.c
>>>> index 07421b3522..dd62be58e5 100644
>>>> --- a/src/dm/impls/plex/tests/ex18.c
>>>> +++ b/src/dm/impls/plex/tests/ex18.c
>>>> @@ -806 +806 @@ static PetscErrorCode CreateMesh(MPI_Comm comm, AppCtx
>>>> *user, DM *dm)
>>>> -    ierr = DMPlexDistribute(*dm, 0, NULL, &pdm);CHKERRQ(ierr);
>>>> +    ierr = DMPlexDistribute(*dm, 1, NULL, &pdm);CHKERRQ(ierr);
>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>>
>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
>
> --
> Stefano
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20200428/fb6e0dd9/attachment.html>