[petsc-users] approaches to reduce computing time

Tue Nov 12 14:59:30 CST 2013

On Tue, Nov 12, 2013 at 2:48 PM, Roc Wang <pengxwang at hotmail.com> wrote:

>
>
> ------------------------------
> Date: Tue, 12 Nov 2013 14:22:35 -0600
> Subject: Re: [petsc-users] approaches to reduce computing time
> From: knepley at gmail.com
> To: pengxwang at hotmail.com
> CC: jedbrown at mcs.anl.gov; petsc-users at mcs.anl.gov
>
> On Tue, Nov 12, 2013 at 2:14 PM, Roc Wang <pengxwang at hotmail.com> wrote:
>
> Thanks Jed,
>
> I have questions about load balance and PC type below.
>
> > From: jedbrown at mcs.anl.gov
> > To: pengxwang at hotmail.com; petsc-users at mcs.anl.gov
> > Subject: Re: [petsc-users] approaches to reduce computing time
> > Date: Sun, 10 Nov 2013 12:20:18 -0700
> >
> > Roc Wang <pengxwang at hotmail.com> writes:
> >
> > > Hi all,
> > >
> > > I am trying to minimize the computing time to solve a large sparse
> matrix. The matrix dimension is with m=321 n=321 and p=321. I am trying to
> reduce the computing time from two directions: 1 finding a Pre-conditioner
> to reduce the number of iterations which reduces the time numerically, 2
> requesting more cores.
> > >
> > > ----For the first method, I tried several methods:
> > > 1 default KSP and PC,
> > > 2 -ksp_type fgmres -ksp_gmres_restart 30 -pc_type ksp -ksp_pc_type
> jacobi,
> > > 3 -ksp_type lgmres -ksp_gmres_restart 40 -ksp_lgmres_augment 10,
> > > 4 -ksp_type lgmres -ksp_gmres_restart 50 -ksp_lgmres_augment 10,
> > > 5 -ksp_type lgmres -ksp_gmres_restart 40 -ksp_lgmres_augment 10
> -pc_type asm (PCASM)
> > >
> > > The iterations and timing is like the following with 128 cores
> requested:
> > > case# iter timing (s)
> > > 1 1436 816
> > > 2 3 12658
> > > 3 1069 669.64
> > > 4 872 768.12
> > > 5 927 513.14
> > >
> > > It can be seen that change -ksp_gmres_restart and -ksp_lgmres_augment
> can help to reduce the iterations but not the timing (comparing case 3 and
> 4). Second, the PCASM helps a lot. Although the second option is able to
> reduce iterations, the timing increases very much. Is it because more
> operations are needed in the PC?
> > >
> > > My questions here are: 1. Which direction should I take to select
> > > -ksp_gmres_restart and -ksp_lgmres_augment? For example, if larger
> > > restart with large augment is better or larger restart with smaller
> > > augment is better?
> >
> > Look at the -log_summary. By increasing the restart, the work in
> > KSPGMRESOrthog will increase linearly, but the number of iterations
> > might decrease enough to compensate. There is no general rule here
> > since it depends on the relative expense of operations for your problem
> > on your machine.
> >
> > > ----For the second method, I tried with -ksp_type lgmres
> -ksp_gmres_restart 40 -ksp_lgmres_augment 10 -pc_type asm with different
> number of cores. I found the speedup ratio increases slowly when more than
> 32 to 64 cores are requested. I searched the milling list archives and
> found that I am very likely running into the memory bandwidth bottleneck.
> http://www.mail-archive.com/petsc-users@mcs.anl.gov/msg19152.html:
> > >
> > > # of cores iter timing
> > > 1 923 19541.83
> > > 4 929 5897.06
> > > 8 932 4854.72
> > > 16 924 1494.33
> > > 32 924 1480.88
> > > 64 928 686.89
> > > 128 927 627.33
> > > 256 926 552.93
> >
> > The bandwidth issue has more to do with using multiple cores within a
> > node rather than between nodes. Likely the above is a load balancing
> > problem or bad communication.
>
> I use DM to manage the distributed data.  The DM was created by calling
> DMDACreate3d() and let PETSc decide the local number of nodes in each
> direction. To my understand the load of each core is determined at this
> stage.   If the load balance is done when DMDACreate3d() is called and use
> PETSC_DECIDE option? Or how should make the load balanced after DM is
> created?
>
>
> We do not have a way to do fine-grained load balancing for the DMDA since
> it is intended for very simple topologies. You can see
> if it is load imbalance from the division by running with a cube that is
> evenly divisible with a cube number of processes.
>
>    Matt
>
> So, I have nothing to do to make the load balanced if I use DMDA?  Would
> you please take a look at the attached log summary files and give me some
> suggestions on how to improve the speedup ratio? Thanks.
>

Please try what I suggested above. And it looks like there is a little load
imbalance

VecAXPY              234 1.0 1.0124e+00 3.4 1.26e+08 1.1 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 15290

VecAXPY              234 1.0 4.2862e-01 3.6 6.37e+07 1.1 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 36115

although it is not limiting the speedup. The time imbalance is really
strange. I am guessing other jobs are running on this machine.

   Matt

> >
> > > My question here is: Is there any other PC can help on both reducing
> iterations and increasing scalability? Thanks.
> >
> > Always send -log_summary with questions like this, but algebraic
> multigrid is a good place to start.
>
> Please take a look at the attached log file, they are for 128 cores and
> 256 cores, respectively.  Based on the log files, what should be done to
> increase the scalability? Thanks.
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20131112/c61c46be/attachment.html>