[petsc-users] Scaling with number of cores

TAY wee-beng zonexo at gmail.com
Tue Nov 3 06:58:23 CST 2015


On 3/11/2015 8:52 PM, Matthew Knepley wrote:
> On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <zonexo at gmail.com 
> <mailto:zonexo at gmail.com>> wrote:
>
>     Hi,
>
>     I tried and have attached the log.
>
>     Ya, my Poisson eqn has Neumann boundary condition. Do I need to
>     specify some null space stuff?  Like KSPSetNullSpace or
>     MatNullSpaceCreate?
>
>
> Yes, you need to attach the constant null space to the matrix.
>
>   Thanks,
>
>      Matt
Ok so can you point me to a suitable example so that I know which one to 
use specifically?

Thanks.
>
>
>     Thank you
>
>     Yours sincerely,
>
>     TAY wee-beng
>
>     On 3/11/2015 12:45 PM, Barry Smith wrote:
>
>             On Nov 2, 2015, at 10:37 PM, TAY wee-beng<zonexo at gmail.com
>             <mailto:zonexo at gmail.com>> wrote:
>
>             Hi,
>
>             I tried :
>
>             1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>
>             2. -poisson_pc_type gamg
>
>             Run with -poisson_ksp_monitor_true_residual
>         -poisson_ksp_monitor_converged_reason
>         Does your poisson have Neumann boundary conditions? Do you
>         have any zeros on the diagonal for the matrix (you shouldn't).
>
>            There may be something wrong with your poisson
>         discretization that was also messing up hypre
>
>
>
>             Both options give:
>
>                 1      0.00150000      0.00000000 0.00000000
>             1.00000000             NaN  NaN             NaN
>             M Diverged but why?, time =            2
>             reason =           -9
>
>             How can I check what's wrong?
>
>             Thank you
>
>             Yours sincerely,
>
>             TAY wee-beng
>
>             On 3/11/2015 3:18 AM, Barry Smith wrote:
>
>                     hypre is just not scaling well here. I do not know
>                 why. Since hypre is a block box for us there is no way
>                 to determine why the poor scaling.
>
>                     If you make the same two runs with -pc_type gamg
>                 there will be a lot more information in the log
>                 summary about in what routines it is scaling well or
>                 poorly.
>
>                    Barry
>
>
>
>                     On Nov 2, 2015, at 3:17 AM, TAY
>                     wee-beng<zonexo at gmail.com
>                     <mailto:zonexo at gmail.com>> wrote:
>
>                     Hi,
>
>                     I have attached the 2 files.
>
>                     Thank you
>
>                     Yours sincerely,
>
>                     TAY wee-beng
>
>                     On 2/11/2015 2:55 PM, Barry Smith wrote:
>
>                            Run (158/2)x(266/2)x(150/2) grid on 8
>                         processes  and then (158)x(266)x(150) on 64
>                         processors  and send the two -log_summary results
>
>                            Barry
>
>
>                             On Nov 2, 2015, at 12:19 AM, TAY
>                             wee-beng<zonexo at gmail.com
>                             <mailto:zonexo at gmail.com>> wrote:
>
>                             Hi,
>
>                             I have attached the new results.
>
>                             Thank you
>
>                             Yours sincerely,
>
>                             TAY wee-beng
>
>                             On 2/11/2015 12:27 PM, Barry Smith wrote:
>
>                                    Run without the -momentum_ksp_view
>                                 -poisson_ksp_view and send the new results
>
>
>                                    You can see from the log summary
>                                 that the PCSetUp is taking a much
>                                 smaller percentage of the time meaning
>                                 that it is reusing the preconditioner
>                                 and not rebuilding it each time.
>
>                                 Barry
>
>                                    Something makes no sense with the
>                                 output: it gives
>
>                                 KSPSolve             199 1.0
>                                 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
>                                 9.9e+05 5.0e+02 90100 66100 24  90100
>                                 66100 24   165
>
>                                 90% of the time is in the solve but
>                                 there is no significant amount of time
>                                 in other events of the code which is
>                                 just not possible. I hope it is due to
>                                 your IO.
>
>
>
>                                     On Nov 1, 2015, at 10:02 PM, TAY
>                                     wee-beng<zonexo at gmail.com
>                                     <mailto:zonexo at gmail.com>> wrote:
>
>                                     Hi,
>
>                                     I have attached the new run with
>                                     100 time steps for 48 and 96 cores.
>
>                                     Only the Poisson eqn 's RHS
>                                     changes, the LHS doesn't. So if I
>                                     want to reuse the preconditioner,
>                                     what must I do? Or what must I not do?
>
>                                     Why does the number of processes
>                                     increase so much? Is there
>                                     something wrong with my coding?
>                                     Seems to be so too for my new run.
>
>                                     Thank you
>
>                                     Yours sincerely,
>
>                                     TAY wee-beng
>
>                                     On 2/11/2015 9:49 AM, Barry Smith
>                                     wrote:
>
>                                            If you are doing many time
>                                         steps with the same linear
>                                         solver then you MUST do your
>                                         weak scaling studies with MANY
>                                         time steps since the setup
>                                         time of AMG only takes place
>                                         in the first stimestep. So run
>                                         both 48 and 96 processes with
>                                         the same large number of time
>                                         steps.
>
>                                            Barry
>
>
>
>                                             On Nov 1, 2015, at 7:35
>                                             PM, TAY
>                                             wee-beng<zonexo at gmail.com
>                                             <mailto:zonexo at gmail.com>>
>                                             wrote:
>
>                                             Hi,
>
>                                             Sorry I forgot and use the
>                                             old a.out. I have attached
>                                             the new log for 48cores
>                                             (log48), together with the
>                                             96cores log (log96).
>
>                                             Why does the number of
>                                             processes increase so
>                                             much? Is there something
>                                             wrong with my coding?
>
>                                             Only the Poisson eqn 's
>                                             RHS changes, the LHS
>                                             doesn't. So if I want to
>                                             reuse the preconditioner,
>                                             what must I do? Or what
>                                             must I not do?
>
>                                             Lastly, I only simulated 2
>                                             time steps previously. Now
>                                             I run for 10 timesteps
>                                             (log48_10). Is it building
>                                             the preconditioner at
>                                             every timestep?
>
>                                             Also, what about momentum
>                                             eqn? Is it working well?
>
>                                             I will try the gamg later too.
>
>                                             Thank you
>
>                                             Yours sincerely,
>
>                                             TAY wee-beng
>
>                                             On 2/11/2015 12:30 AM,
>                                             Barry Smith wrote:
>
>                                                    You used gmres with
>                                                 48 processes but
>                                                 richardson with 96.
>                                                 You need to be careful
>                                                 and make sure you
>                                                 don't change the
>                                                 solvers when you
>                                                 change the number of
>                                                 processors since you
>                                                 can get very different
>                                                 inconsistent results
>
>                                                     Anyways all the
>                                                 time is being spent in
>                                                 the BoomerAMG
>                                                 algebraic multigrid
>                                                 setup and it is is
>                                                 scaling badly. When
>                                                 you double the problem
>                                                 size and number of
>                                                 processes it went from
>                                                 3.2445e+01 to
>                                                 4.3599e+02 seconds.
>
>                                                 PCSetUp               
>                                                 3 1.0 3.2445e+01 1.0
>                                                 9.58e+06 2.0 0.0e+00
>                                                 0.0e+00 4.0e+00 62  8
>                                                 0  0  4  62  8  0  0 
>                                                 5    11
>
>                                                 PCSetUp               
>                                                 3 1.0 4.3599e+02 1.0
>                                                 9.58e+06 2.0 0.0e+00
>                                                 0.0e+00 4.0e+00 85 18
>                                                 0  0  6  85 18  0  0 
>                                                 6     2
>
>                                                    Now is the Poisson
>                                                 problem changing at
>                                                 each timestep or can
>                                                 you use the same
>                                                 preconditioner built
>                                                 with BoomerAMG for all
>                                                 the time steps?
>                                                 Algebraic multigrid
>                                                 has a large set up
>                                                 time that you often
>                                                 doesn't matter if you
>                                                 have many time steps
>                                                 but if you have to
>                                                 rebuild it each
>                                                 timestep it is too large?
>
>                                                    You might also try
>                                                 -pc_type gamg and see
>                                                 how PETSc's algebraic
>                                                 multigrid scales for
>                                                 your problem/machine.
>
>                                                    Barry
>
>
>
>                                                     On Nov 1, 2015, at
>                                                     7:30 AM, TAY
>                                                     wee-beng<zonexo at gmail.com
>                                                     <mailto:zonexo at gmail.com>>
>                                                     wrote:
>
>
>                                                     On 1/11/2015 10:00
>                                                     AM, Barry Smith wrote:
>
>                                                             On Oct 31,
>                                                             2015, at
>                                                             8:43 PM,
>                                                             TAY
>                                                             wee-beng<zonexo at gmail.com
>                                                             <mailto:zonexo at gmail.com>>
>                                                             wrote:
>
>
>                                                             On
>                                                             1/11/2015
>                                                             12:47 AM,
>                                                             Matthew
>                                                             Knepley wrote:
>
>                                                                 On
>                                                                 Sat,
>                                                                 Oct
>                                                                 31,
>                                                                 2015
>                                                                 at
>                                                                 11:34
>                                                                 AM,
>                                                                 TAY
>                                                                 wee-beng<zonexo at gmail.com
>                                                                 <mailto:zonexo at gmail.com>>
>                                                                 wrote:
>                                                                 Hi,
>
>                                                                 I
>                                                                 understand
>                                                                 that
>                                                                 as
>                                                                 mentioned
>                                                                 in the
>                                                                 faq,
>                                                                 due to
>                                                                 the
>                                                                 limitations
>                                                                 in
>                                                                 memory, the
>                                                                 scaling is
>                                                                 not
>                                                                 linear. So,
>                                                                 I am
>                                                                 trying
>                                                                 to
>                                                                 write
>                                                                 a
>                                                                 proposal
>                                                                 to use
>                                                                 a
>                                                                 supercomputer.
>                                                                 Its
>                                                                 specs are:
>                                                                 Compute nodes:
>                                                                 82,944
>                                                                 nodes
>                                                                 (SPARC64
>                                                                 VIIIfx; 16GB
>                                                                 of
>                                                                 memory
>                                                                 per node)
>
>                                                                 8
>                                                                 cores
>                                                                 /
>                                                                 processor
>                                                                 Interconnect:
>                                                                 Tofu
>                                                                 (6-dimensional
>                                                                 mesh/torus)
>                                                                 Interconnect
>                                                                 Each
>                                                                 cabinet contains
>                                                                 96
>                                                                 computing
>                                                                 nodes,
>                                                                 One of
>                                                                 the
>                                                                 requirement
>                                                                 is to
>                                                                 give
>                                                                 the
>                                                                 performance
>                                                                 of my
>                                                                 current code
>                                                                 with
>                                                                 my
>                                                                 current set
>                                                                 of
>                                                                 data,
>                                                                 and
>                                                                 there
>                                                                 is a
>                                                                 formula to
>                                                                 calculate
>                                                                 the
>                                                                 estimated
>                                                                 parallel
>                                                                 efficiency
>                                                                 when
>                                                                 using
>                                                                 the
>                                                                 new
>                                                                 large
>                                                                 set of
>                                                                 data
>                                                                 There
>                                                                 are 2
>                                                                 ways
>                                                                 to
>                                                                 give
>                                                                 performance:
>                                                                 1.
>                                                                 Strong
>                                                                 scaling,
>                                                                 which
>                                                                 is
>                                                                 defined as
>                                                                 how
>                                                                 the
>                                                                 elapsed time
>                                                                 varies
>                                                                 with
>                                                                 the
>                                                                 number
>                                                                 of
>                                                                 processors
>                                                                 for a
>                                                                 fixed
>                                                                 problem.
>                                                                 2.
>                                                                 Weak
>                                                                 scaling,
>                                                                 which
>                                                                 is
>                                                                 defined as
>                                                                 how
>                                                                 the
>                                                                 elapsed time
>                                                                 varies
>                                                                 with
>                                                                 the
>                                                                 number
>                                                                 of
>                                                                 processors
>                                                                 for a
>                                                                 fixed
>                                                                 problem size
>                                                                 per
>                                                                 processor.
>                                                                 I ran
>                                                                 my
>                                                                 cases
>                                                                 with
>                                                                 48 and
>                                                                 96
>                                                                 cores
>                                                                 with
>                                                                 my
>                                                                 current cluster,
>                                                                 giving
>                                                                 140
>                                                                 and 90
>                                                                 mins
>                                                                 respectively.
>                                                                 This
>                                                                 is
>                                                                 classified
>                                                                 as
>                                                                 strong
>                                                                 scaling.
>                                                                 Cluster specs:
>                                                                 CPU:
>                                                                 AMD
>                                                                 6234
>                                                                 2.4GHz
>                                                                 8
>                                                                 cores
>                                                                 /
>                                                                 processor
>                                                                 (CPU)
>                                                                 6 CPU
>                                                                 / node
>                                                                 So 48
>                                                                 Cores
>                                                                 / CPU
>                                                                 Not
>                                                                 sure
>                                                                 abt
>                                                                 the
>                                                                 memory
>                                                                 / node
>
>                                                                 The
>                                                                 parallel
>                                                                 efficiency
>                                                                 ‘En’
>                                                                 for a
>                                                                 given
>                                                                 degree
>                                                                 of
>                                                                 parallelism
>                                                                 ‘n’
>                                                                 indicates
>                                                                 how
>                                                                 much
>                                                                 the
>                                                                 program is
>                                                                 efficiently
>                                                                 accelerated
>                                                                 by
>                                                                 parallel
>                                                                 processing.
>                                                                 ‘En’
>                                                                 is
>                                                                 given
>                                                                 by the
>                                                                 following
>                                                                 formulae.
>                                                                 Although
>                                                                 their
>                                                                 derivation
>                                                                 processes
>                                                                 are
>                                                                 different
>                                                                 depending
>                                                                 on
>                                                                 strong
>                                                                 and
>                                                                 weak
>                                                                 scaling,
>                                                                 derived formulae
>                                                                 are the
>                                                                 same.
>                                                                  From
>                                                                 the
>                                                                 estimated
>                                                                 time,
>                                                                 my
>                                                                 parallel
>                                                                 efficiency
>                                                                 using
>                                                                 Amdahl's
>                                                                 law on
>                                                                 the
>                                                                 current old
>                                                                 cluster was
>                                                                 52.7%.
>                                                                 So is
>                                                                 my
>                                                                 results acceptable?
>                                                                 For
>                                                                 the
>                                                                 large
>                                                                 data
>                                                                 set,
>                                                                 if
>                                                                 using
>                                                                 2205
>                                                                 nodes
>                                                                 (2205X8cores),
>                                                                 my
>                                                                 expected
>                                                                 parallel
>                                                                 efficiency
>                                                                 is
>                                                                 only
>                                                                 0.5%.
>                                                                 The
>                                                                 proposal
>                                                                 recommends
>                                                                 value
>                                                                 of > 50%.
>                                                                 The
>                                                                 problem with
>                                                                 this
>                                                                 analysis
>                                                                 is
>                                                                 that
>                                                                 the
>                                                                 estimated
>                                                                 serial
>                                                                 fraction
>                                                                 from
>                                                                 Amdahl's
>                                                                 Law 
>                                                                 changes as
>                                                                 a function
>                                                                 of
>                                                                 problem size,
>                                                                 so you
>                                                                 cannot
>                                                                 take
>                                                                 the
>                                                                 strong
>                                                                 scaling from
>                                                                 one
>                                                                 problem and
>                                                                 apply
>                                                                 it to
>                                                                 another without
>                                                                 a
>                                                                 model
>                                                                 of
>                                                                 this
>                                                                 dependence.
>
>                                                                 Weak
>                                                                 scaling does
>                                                                 model
>                                                                 changes with
>                                                                 problem size,
>                                                                 so I
>                                                                 would
>                                                                 measure weak
>                                                                 scaling on
>                                                                 your
>                                                                 current
>                                                                 cluster,
>                                                                 and
>                                                                 extrapolate
>                                                                 to the
>                                                                 big
>                                                                 machine.
>                                                                 I
>                                                                 realize that
>                                                                 this
>                                                                 does
>                                                                 not
>                                                                 make
>                                                                 sense
>                                                                 for
>                                                                 many
>                                                                 scientific
>                                                                 applications,
>                                                                 but
>                                                                 neither does
>                                                                 requiring
>                                                                 a
>                                                                 certain parallel
>                                                                 efficiency.
>
>                                                             Ok I check
>                                                             the
>                                                             results
>                                                             for my
>                                                             weak
>                                                             scaling it
>                                                             is even
>                                                             worse for
>                                                             the
>                                                             expected
>                                                             parallel
>                                                             efficiency. From
>                                                             the
>                                                             formula
>                                                             used, it's
>                                                             obvious
>                                                             it's doing
>                                                             some sort
>                                                             of
>                                                             exponential extrapolation
>                                                             decrease.
>                                                             So unless
>                                                             I can
>                                                             achieve a
>                                                             near > 90%
>                                                             speed up
>                                                             when I
>                                                             double the
>                                                             cores and
>                                                             problem
>                                                             size for
>                                                             my current
>                                                             48/96
>                                                             cores
>                                                             setup,   
>                                                              extrapolating
>                                                             from about
>                                                             96 nodes
>                                                             to 10,000
>                                                             nodes will
>                                                             give a
>                                                             much lower
>                                                             expected
>                                                             parallel
>                                                             efficiency
>                                                             for the
>                                                             new case.
>
>                                                             However,
>                                                             it's
>                                                             mentioned
>                                                             in the FAQ
>                                                             that due
>                                                             to memory
>                                                             requirement,
>                                                             it's
>                                                             impossible
>                                                             to get
>                                                             >90% speed
>                                                             when I
>                                                             double the
>                                                             cores and
>                                                             problem
>                                                             size (ie
>                                                             linear
>                                                             increase
>                                                             in
>                                                             performance),
>                                                             which
>                                                             means that
>                                                             I can't
>                                                             get >90%
>                                                             speed up
>                                                             when I
>                                                             double the
>                                                             cores and
>                                                             problem
>                                                             size for
>                                                             my current
>                                                             48/96
>                                                             cores
>                                                             setup. Is
>                                                             that so?
>
>                                                            What is the
>                                                         output of
>                                                         -ksp_view
>                                                         -log_summary
>                                                         on the problem
>                                                         and then on
>                                                         the problem
>                                                         doubled in
>                                                         size and
>                                                         number of
>                                                         processors?
>
>                                                            Barry
>
>                                                     Hi,
>
>                                                     I have attached
>                                                     the output
>
>                                                     48 cores: log48
>                                                     96 cores: log96
>
>                                                     There are 2
>                                                     solvers - The
>                                                     momentum linear
>                                                     eqn uses bcgs,
>                                                     while the Poisson
>                                                     eqn uses hypre
>                                                     BoomerAMG.
>
>                                                     Problem size
>                                                     doubled from
>                                                     158x266x150 to
>                                                     158x266x300.
>
>                                                             So is it
>                                                             fair to
>                                                             say that
>                                                             the main
>                                                             problem
>                                                             does not
>                                                             lie in my
>                                                             programming skills,
>                                                             but rather
>                                                             the way
>                                                             the linear
>                                                             equations
>                                                             are solved?
>
>                                                             Thanks.
>
>                                                                    Thanks,
>
>                                                                       Matt
>                                                                 Is it
>                                                                 possible
>                                                                 for
>                                                                 this
>                                                                 type
>                                                                 of
>                                                                 scaling in
>                                                                 PETSc
>                                                                 (>50%), when
>                                                                 using
>                                                                 17640
>                                                                 (2205X8)
>                                                                 cores?
>                                                                 Btw, I
>                                                                 do not
>                                                                 have
>                                                                 access
>                                                                 to the
>                                                                 system.
>
>
>
>                                                                 Sent
>                                                                 using
>                                                                 CloudMagic
>                                                                 Email
>
>
>
>                                                                 -- 
>                                                                 What
>                                                                 most
>                                                                 experimenters
>                                                                 take
>                                                                 for
>                                                                 granted before
>                                                                 they
>                                                                 begin
>                                                                 their
>                                                                 experiments
>                                                                 is
>                                                                 infinitely
>                                                                 more
>                                                                 interesting
>                                                                 than
>                                                                 any
>                                                                 results to
>                                                                 which
>                                                                 their
>                                                                 experiments
>                                                                 lead.
>                                                                 --
>                                                                 Norbert Wiener
>
>                                                     <log48.txt><log96.txt>
>
>                                             <log48_10.txt><log48.txt><log96.txt>
>
>                                     <log96_100.txt><log48_100.txt>
>
>                             <log96_100_2.txt><log48_100_2.txt>
>
>                     <log64_100.txt><log8_100.txt>
>
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their 
> experiments is infinitely more interesting than any results to which 
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151103/fafa25af/attachment-0001.html>


More information about the petsc-users mailing list