[petsc-users] Poor weak scaling when solvingsuccessivelinearsystems

Mark Adams mfadams at lbl.gov
Thu May 24 09:58:19 CDT 2018


On Thu, May 24, 2018 at 10:49 AM, Michael Becker <
Michael.Becker at physik.uni-giessen.de> wrote:

> Yes, the time increment is the problem. Not because of these 8% in
> particular, but it gets worse with more processes.
>

It will never be perfect. 8% is not bad but there is clearly some bad stuff
going on here.

Note, the VecScatter time could be catching load imbalance for elsewhere in
the code. So this could be all load imbalance.


> Performance does improve drastically on one processor (which I previously
> never tested);
>
You also get all the memory bandwidth with one process, assuming you have a
multi-core processor. This is a big win also. So one processor test is
mixing different things. If you can fill one socket then that would isolate
communication, including things like packing send buffers.


> I attached the log_view file. If communication speed is the problem, then
> I assume fewer processors per node would improve performance and I could
> investigate that (principally). I assume there's no way to reduce the data
> volume.
>
> But thanks either way, this helped a lot.
>
> Michael
>
>
>
> Am 24.05.2018 um 15:22 schrieb Mark Adams:
>
> The KSPSolve time goes from 128 to 138 seconds in going from 125 to 1000
> processes. Is this the problem?
>
> And as Lawrence pointed out there is a lot of "load" imbalance. (This
> could come from a poor network). VecAXPY has no communication and has
> significant imbalance. But you seem to have perfect actual load imbalance
> but this can come from cache effects....
>
> And you are spending almost half the solve time in VecScatter. If you
> really have this nice regular partitioning of the problem, then your
> communication is slow, even on 125 processors. (So it is not a scaling
> issue here, but if you do a one processor test you should see it).
>
> Note, AMG coarse grids get bigger as the problem gets bigger, so it is not
> perfectly scalable pre-asymptotically. Nothing is really, because you don't
> saturate communication until you have at least a 3^D process grid and
> various random things will cause some non-perfect weak speedup.
>
> Mark
>
>
> On Thu, May 24, 2018 at 5:10 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
>
>> CG/GCR: I accidentally kept gcr in the batch file. That's still from when
>> I was experimenting with the different methods. The performance is quite
>> similar though.
>>
>> I use the following setup for the ksp object and the vectors:
>>
>> ierr=PetscInitialize(&argc, &argv, (char*)0, (char*)0);CHKERRQ(ierr);
>>
>> ierr=KSPCreate(PETSC_COMM_WORLD, &ksp);CHKERRQ(ierr);
>>
>> ierr=DMDACreate3d(PETSC_COMM_WORLD,DM_BOUNDARY_GHOSTED,DM_BO
>> UNDARY_GHOSTED,DM_BOUNDARY_GHOSTED,
>>              DMDA_STENCIL_STAR, g_Nx, g_Ny, g_Nz, dims[0], dims[1],
>> dims[2], 1, 1, l_Nx, l_Ny, l_Nz, &da);CHKERRQ(ierr);
>> ierr=DMSetFromOptions(da);CHKERRQ(ierr);
>> ierr=DMSetUp(da);CHKERRQ(ierr);
>> ierr=KSPSetDM(ksp, da);CHKERRQ(ierr);
>>
>> ierr=DMCreateGlobalVector(da, &b);CHKERRQ(ierr);
>>
>> ierr=VecDuplicate(b, &x);CHKERRQ(ierr);
>>
>> ierr=DMCreateLocalVector(da, &l_x);CHKERRQ(ierr);
>> ierr=VecSet(x,0);CHKERRQ(ierr);
>> ierr=VecSet(b,0);CHKERRQ(ierr);
>>
>> For the 125 case the arrays l_Nx, l_Ny, l_Nz have dimension 5 and every
>> element has value 30. VecGetLocalSize() returns 27000 for every rank. Is
>> there something I didn't consider?
>>
>> Michael
>>
>>
>>
>> Am 24.05.2018 um 09:39 schrieb Lawrence Mitchell:
>>
>> On 24 May 2018, at 06:24, Michael Becker <Michael.Becker at physik.uni-giessen.de> <Michael.Becker at physik.uni-giessen.de> wrote:
>>
>> Could you have a look at the attached log_view files and tell me if something is particularly odd? The system size per processor is 30^3 and the simulation ran over 1000 timesteps, which means KSPsolve() was called equally often. I introduced two new logging states - one for the first solve and the final setup and one for the remaining solves.
>>
>> The two attached logs use CG for the 125 proc run, but gcr for the 1000 proc run.  Is this deliberate?
>>
>> 125 proc:
>>
>> -gamg_est_ksp_type cg
>> -ksp_norm_type unpreconditioned
>> -ksp_type cg
>> -log_view
>> -mg_levels_esteig_ksp_max_it 10
>> -mg_levels_esteig_ksp_type cg
>> -mg_levels_ksp_max_it 1
>> -mg_levels_ksp_norm_type none
>> -mg_levels_ksp_type richardson
>> -mg_levels_pc_sor_its 1
>> -mg_levels_pc_type sor
>> -pc_gamg_type classical
>> -pc_type gamg
>>
>> 1000 proc:
>>
>> -gamg_est_ksp_type cg
>> -ksp_norm_type unpreconditioned
>> -ksp_type gcr
>> -log_view
>> -mg_levels_esteig_ksp_max_it 10
>> -mg_levels_esteig_ksp_type cg
>> -mg_levels_ksp_max_it 1
>> -mg_levels_ksp_norm_type none
>> -mg_levels_ksp_type richardson
>> -mg_levels_pc_sor_its 1
>> -mg_levels_pc_type sor
>> -pc_gamg_type classical
>> -pc_type gamg
>>
>>
>> That aside, it looks like you have quite a bit of load imbalance.  e.g. in the smoother, where you're doing MatSOR, you have:
>>
>> 125 proc:
>>                    Calls     Time       Max/Min time
>> MatSOR             47808 1.0 6.8888e+01 1.7
>>
>> 1000 proc:
>>
>> MatSOR             41400 1.0 6.3412e+01 1.6
>>
>> VecScatters show similar behaviour.
>>
>> How is your problem distributed across the processes?
>>
>> Cheers,
>>
>> Lawrence
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180524/194caccc/attachment.html>


More information about the petsc-users mailing list