[petsc-users] Poor weak scaling when solvingsuccessivelinearsystems

Thu May 24 09:49:29 CDT 2018

Yes, the time increment is the problem. Not because of these 8% in 
particular, but it gets worse with more processes.

Performance does improve drastically on one processor (which I 
previously never tested); I attached the log_view file. If communication 
speed is the problem, then I assume fewer processors per node would 
improve performance and I could investigate that (principally). I assume 
there's no way to reduce the data volume.

But thanks either way, this helped a lot.

Michael

Am 24.05.2018 um 15:22 schrieb Mark Adams:
> The KSPSolve time goes from 128 to 138 seconds in going from 125 to 
> 1000 processes. Is this the problem?
>
> And as Lawrence pointed out there is a lot of "load" imbalance. (This 
> could come from a poor network). VecAXPY has no communication and has 
> significant imbalance. But you seem to have perfect actual load 
> imbalance but this can come from cache effects....
>
> And you are spending almost half the solve time in VecScatter. If you 
> really have this nice regular partitioning of the problem, then your 
> communication is slow, even on 125 processors. (So it is not a scaling 
> issue here, but if you do a one processor test you should see it).
>
> Note, AMG coarse grids get bigger as the problem gets bigger, so it is 
> not perfectly scalable pre-asymptotically. Nothing is really, because 
> you don't saturate communication until you have at least a 3^D process 
> grid and various random things will cause some non-perfect weak speedup.
>
> Mark
>
>
> On Thu, May 24, 2018 at 5:10 AM, Michael Becker 
> <Michael.Becker at physik.uni-giessen.de 
> <mailto:Michael.Becker at physik.uni-giessen.de>> wrote:
>
>     CG/GCR: I accidentally kept gcr in the batch file. That's still
>     from when I was experimenting with the different methods. The
>     performance is quite similar though.
>
>     I use the following setup for the ksp object and the vectors:
>
>         ierr=PetscInitialize(&argc, &argv, (char*)0,
>         (char*)0);CHKERRQ(ierr);
>
>         ierr=KSPCreate(PETSC_COMM_WORLD, &ksp);CHKERRQ(ierr);
>
>         ierr=DMDACreate3d(PETSC_COMM_WORLD,DM_BOUNDARY_GHOSTED,DM_BOUNDARY_GHOSTED,DM_BOUNDARY_GHOSTED,
>                      DMDA_STENCIL_STAR, g_Nx, g_Ny, g_Nz, dims[0],
>         dims[1], dims[2], 1, 1, l_Nx, l_Ny, l_Nz, &da);CHKERRQ(ierr);
>         ierr=DMSetFromOptions(da);CHKERRQ(ierr);
>         ierr=DMSetUp(da);CHKERRQ(ierr);
>         ierr=KSPSetDM(ksp, da);CHKERRQ(ierr);
>
>         ierr=DMCreateGlobalVector(da, &b);CHKERRQ(ierr);
>
>         ierr=VecDuplicate(b, &x);CHKERRQ(ierr);
>
>         ierr=DMCreateLocalVector(da, &l_x);CHKERRQ(ierr);
>         ierr=VecSet(x,0);CHKERRQ(ierr);
>         ierr=VecSet(b,0);CHKERRQ(ierr);
>
>     For the 125 case the arrays l_Nx, l_Ny, l_Nz have dimension 5 and
>     every element has value 30. VecGetLocalSize() returns 27000 for
>     every rank. Is there something I didn't consider?
>
>     Michael
>
>
>
>     Am 24.05.2018 um 09:39 schrieb Lawrence Mitchell:
>>>     On 24 May 2018, at 06:24, Michael Becker<Michael.Becker at physik.uni-giessen.de>
>>>     <mailto:Michael.Becker at physik.uni-giessen.de>  wrote:
>>>
>>>     Could you have a look at the attached log_view files and tell me if something is particularly odd? The system size per processor is 30^3 and the simulation ran over 1000 timesteps, which means KSPsolve() was called equally often. I introduced two new logging states - one for the first solve and the final setup and one for the remaining solves.
>>     The two attached logs use CG for the 125 proc run, but gcr for the 1000 proc run.  Is this deliberate?
>>
>>     125 proc:
>>
>>     -gamg_est_ksp_type cg
>>     -ksp_norm_type unpreconditioned
>>     -ksp_type cg
>>     -log_view
>>     -mg_levels_esteig_ksp_max_it 10
>>     -mg_levels_esteig_ksp_type cg
>>     -mg_levels_ksp_max_it 1
>>     -mg_levels_ksp_norm_type none
>>     -mg_levels_ksp_type richardson
>>     -mg_levels_pc_sor_its 1
>>     -mg_levels_pc_type sor
>>     -pc_gamg_type classical
>>     -pc_type gamg
>>
>>     1000 proc:
>>
>>     -gamg_est_ksp_type cg
>>     -ksp_norm_type unpreconditioned
>>     -ksp_type gcr
>>     -log_view
>>     -mg_levels_esteig_ksp_max_it 10
>>     -mg_levels_esteig_ksp_type cg
>>     -mg_levels_ksp_max_it 1
>>     -mg_levels_ksp_norm_type none
>>     -mg_levels_ksp_type richardson
>>     -mg_levels_pc_sor_its 1
>>     -mg_levels_pc_type sor
>>     -pc_gamg_type classical
>>     -pc_type gamg
>>
>>
>>     That aside, it looks like you have quite a bit of load imbalance.  e.g. in the smoother, where you're doing MatSOR, you have:
>>
>>     125 proc:
>>                         Calls     Time       Max/Min time
>>     MatSOR             47808 1.0 6.8888e+01 1.7
>>
>>     1000 proc:
>>
>>     MatSOR             41400 1.0 6.3412e+01 1.6
>>
>>     VecScatters show similar behaviour.
>>
>>     How is your problem distributed across the processes?
>>
>>     Cheers,
>>
>>     Lawrence
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180524/ab8bb3e1/attachment-0001.html>