[petsc-users] Scaling with number of cores
TAY wee-beng
zonexo at gmail.com
Tue Nov 3 09:04:56 CST 2015
On 3/11/2015 9:01 PM, Matthew Knepley wrote:
> On Tue, Nov 3, 2015 at 6:58 AM, TAY wee-beng <zonexo at gmail.com
> <mailto:zonexo at gmail.com>> wrote:
>
>
> On 3/11/2015 8:52 PM, Matthew Knepley wrote:
>> On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <zonexo at gmail.com
>> <mailto:zonexo at gmail.com>> wrote:
>>
>> Hi,
>>
>> I tried and have attached the log.
>>
>> Ya, my Poisson eqn has Neumann boundary condition. Do I need
>> to specify some null space stuff? Like KSPSetNullSpace or
>> MatNullSpaceCreate?
>>
>>
>> Yes, you need to attach the constant null space to the matrix.
>>
>> Thanks,
>>
>> Matt
> Ok so can you point me to a suitable example so that I know which
> one to use specifically?
>
>
> https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
>
> Matt
Hi,
Actually, I realised that for my Poisson eqn, I have neumann and
dirichlet BC. Dirichlet BC is at the output grids by specifying pressure
= 0. So do I still need the null space?
My Poisson eqn LHS is fixed but RHS is changing with every timestep.
If I need to use null space, how do I know if the null space contains
the constant vector and what the the no. of vectors? I follow the
example given and added:
call MatNullSpaceCreate(MPI_COMM_WORLD,PETSC_TRUE,0,NULL,nullsp,ierr)
call MatSetNullSpace(A,nullsp,ierr)
call MatNullSpaceDestroy(nullsp,ierr)
Is that all?
Before this, I was using HYPRE geometric solver and the matrix / vector
in the subroutine was written based on HYPRE. It worked pretty well and
fast.
However, it's a black box and it's hard to diagnose problems.
I always had the PETSc subroutine to solve my Poisson eqn but I used
KSPBCGS or KSPGMRES with HYPRE's boomeramg as the PC. It worked but was
slow.
Matt: Thanks, I will see how it goes using the nullspace and may try
"/-mg_coarse_pc_type svd/" later.
>
> Thanks.
>>
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>
>> On Nov 2, 2015, at 10:37 PM, TAY
>> wee-beng<zonexo at gmail.com <mailto:zonexo at gmail.com>>
>> wrote:
>>
>> Hi,
>>
>> I tried :
>>
>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>
>> 2. -poisson_pc_type gamg
>>
>> Run with -poisson_ksp_monitor_true_residual
>> -poisson_ksp_monitor_converged_reason
>> Does your poisson have Neumann boundary conditions? Do
>> you have any zeros on the diagonal for the matrix (you
>> shouldn't).
>>
>> There may be something wrong with your poisson
>> discretization that was also messing up hypre
>>
>>
>>
>> Both options give:
>>
>> 1 0.00150000 0.00000000 0.00000000
>> 1.00000000 NaN NaN NaN
>> M Diverged but why?, time = 2
>> reason = -9
>>
>> How can I check what's wrong?
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>
>> hypre is just not scaling well here. I do not
>> know why. Since hypre is a block box for us there
>> is no way to determine why the poor scaling.
>>
>> If you make the same two runs with -pc_type
>> gamg there will be a lot more information in the
>> log summary about in what routines it is scaling
>> well or poorly.
>>
>> Barry
>>
>>
>>
>> On Nov 2, 2015, at 3:17 AM, TAY
>> wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>> wrote:
>>
>> Hi,
>>
>> I have attached the 2 files.
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>
>> Run (158/2)x(266/2)x(150/2) grid on 8
>> processes and then (158)x(266)x(150) on
>> 64 processors and send the two
>> -log_summary results
>>
>> Barry
>>
>>
>> On Nov 2, 2015, at 12:19 AM, TAY
>> wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>> wrote:
>>
>> Hi,
>>
>> I have attached the new results.
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>
>> Run without the
>> -momentum_ksp_view
>> -poisson_ksp_view and send the
>> new results
>>
>>
>> You can see from the log
>> summary that the PCSetUp is
>> taking a much smaller percentage
>> of the time meaning that it is
>> reusing the preconditioner and
>> not rebuilding it each time.
>>
>> Barry
>>
>> Something makes no sense with
>> the output: it gives
>>
>> KSPSolve 199 1.0
>> 2.3298e+03 1.0 5.20e+09 1.8
>> 3.8e+04 9.9e+05 5.0e+02 90100
>> 66100 24 90100 66100 24 165
>>
>> 90% of the time is in the solve
>> but there is no significant
>> amount of time in other events of
>> the code which is just not
>> possible. I hope it is due to
>> your IO.
>>
>>
>>
>> On Nov 1, 2015, at 10:02 PM,
>> TAY wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>> wrote:
>>
>> Hi,
>>
>> I have attached the new run
>> with 100 time steps for 48
>> and 96 cores.
>>
>> Only the Poisson eqn 's RHS
>> changes, the LHS doesn't. So
>> if I want to reuse the
>> preconditioner, what must I
>> do? Or what must I not do?
>>
>> Why does the number of
>> processes increase so much?
>> Is there something wrong with
>> my coding? Seems to be so too
>> for my new run.
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 2/11/2015 9:49 AM, Barry
>> Smith wrote:
>>
>> If you are doing many
>> time steps with the same
>> linear solver then you
>> MUST do your weak scaling
>> studies with MANY time
>> steps since the setup
>> time of AMG only takes
>> place in the first
>> stimestep. So run both 48
>> and 96 processes with the
>> same large number of time
>> steps.
>>
>> Barry
>>
>>
>>
>> On Nov 1, 2015, at
>> 7:35 PM, TAY
>> wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>>
>> wrote:
>>
>> Hi,
>>
>> Sorry I forgot and
>> use the old a.out. I
>> have attached the new
>> log for 48cores
>> (log48), together
>> with the 96cores log
>> (log96).
>>
>> Why does the number
>> of processes increase
>> so much? Is there
>> something wrong with
>> my coding?
>>
>> Only the Poisson eqn
>> 's RHS changes, the
>> LHS doesn't. So if I
>> want to reuse the
>> preconditioner, what
>> must I do? Or what
>> must I not do?
>>
>> Lastly, I only
>> simulated 2 time
>> steps previously. Now
>> I run for 10
>> timesteps (log48_10).
>> Is it building the
>> preconditioner at
>> every timestep?
>>
>> Also, what about
>> momentum eqn? Is it
>> working well?
>>
>> I will try the gamg
>> later too.
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 2/11/2015 12:30
>> AM, Barry Smith wrote:
>>
>> You used gmres
>> with 48 processes
>> but richardson
>> with 96. You need
>> to be careful and
>> make sure you
>> don't change the
>> solvers when you
>> change the number
>> of processors
>> since you can get
>> very different
>> inconsistent results
>>
>> Anyways all
>> the time is being
>> spent in the
>> BoomerAMG
>> algebraic
>> multigrid setup
>> and it is is
>> scaling badly.
>> When you double
>> the problem size
>> and number of
>> processes it went
>> from 3.2445e+01
>> to 4.3599e+02
>> seconds.
>>
>> PCSetUp 3 1.0
>> 3.2445e+01 1.0
>> 9.58e+06 2.0
>> 0.0e+00 0.0e+00
>> 4.0e+00 62 8 0
>> 0 4 62 8 0 0
>> 5 11
>>
>> PCSetUp 3 1.0
>> 4.3599e+02 1.0
>> 9.58e+06 2.0
>> 0.0e+00 0.0e+00
>> 4.0e+00 85 18 0
>> 0 6 85 18 0 0
>> 6 2
>>
>> Now is the
>> Poisson problem
>> changing at each
>> timestep or can
>> you use the same
>> preconditioner
>> built with
>> BoomerAMG for all
>> the time steps?
>> Algebraic
>> multigrid has a
>> large set up time
>> that you often
>> doesn't matter if
>> you have many
>> time steps but if
>> you have to
>> rebuild it each
>> timestep it is
>> too large?
>>
>> You might also
>> try -pc_type gamg
>> and see how
>> PETSc's algebraic
>> multigrid scales
>> for your
>> problem/machine.
>>
>> Barry
>>
>>
>>
>> On Nov 1,
>> 2015, at 7:30
>> AM, TAY
>> wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>>
>> wrote:
>>
>>
>> On 1/11/2015
>> 10:00 AM,
>> Barry Smith
>> wrote:
>>
>> On
>> Oct
>> 31,
>> 2015,
>> at
>> 8:43
>> PM,
>> TAY
>> wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>>
>> wrote:
>>
>>
>> On
>> 1/11/2015
>> 12:47
>> AM,
>> Matthew
>> Knepley
>> wrote:
>>
>> On Sat,
>> Oct
>> 31,
>> 2015
>> at 11:34
>> AM,
>> TAY
>> wee-beng<zonexo at gmail.com
>> <mailto:zonexo at gmail.com>>
>> wrote:
>> Hi,
>>
>> I
>> understand
>> that
>> as mentioned
>> in the
>> faq,
>> due
>> to the
>> limitations
>> in memory,
>> the
>> scaling
>> is not
>> linear.
>> So,
>> I
>> am trying
>> to write
>> a
>> proposal
>> to use
>> a
>> supercomputer.
>> Its
>> specs
>> are:
>> Compute
>> nodes:
>> 82,944
>> nodes
>> (SPARC64
>> VIIIfx;
>> 16GB
>> of memory
>> per
>> node)
>>
>> 8
>> cores
>> /
>> processor
>> Interconnect:
>> Tofu
>> (6-dimensional
>> mesh/torus)
>> Interconnect
>> Each
>> cabinet
>> contains
>> 96 computing
>> nodes,
>> One
>> of the
>> requirement
>> is to
>> give
>> the
>> performance
>> of my
>> current
>> code
>> with
>> my current
>> set
>> of data,
>> and
>> there
>> is a
>> formula
>> to calculate
>> the
>> estimated
>> parallel
>> efficiency
>> when
>> using
>> the
>> new
>> large
>> set
>> of data
>> There
>> are
>> 2
>> ways
>> to give
>> performance:
>> 1. Strong
>> scaling,
>> which
>> is defined
>> as how
>> the
>> elapsed
>> time
>> varies
>> with
>> the
>> number
>> of processors
>> for
>> a
>> fixed
>> problem.
>> 2. Weak
>> scaling,
>> which
>> is defined
>> as how
>> the
>> elapsed
>> time
>> varies
>> with
>> the
>> number
>> of processors
>> for a
>> fixed
>> problem
>> size
>> per
>> processor.
>> I
>> ran
>> my cases
>> with
>> 48 and
>> 96 cores
>> with
>> my current
>> cluster,
>> giving
>> 140
>> and
>> 90 mins
>> respectively.
>> This
>> is classified
>> as strong
>> scaling.
>> Cluster
>> specs:
>> CPU:
>> AMD
>> 6234
>> 2.4GHz
>> 8
>> cores
>> /
>> processor
>> (CPU)
>> 6
>> CPU
>> /
>> node
>> So 48
>> Cores
>> / CPU
>> Not
>> sure
>> abt
>> the
>> memory
>> /
>> node
>>
>> The
>> parallel
>> efficiency
>> ‘En’
>> for
>> a
>> given
>> degree
>> of parallelism
>> ‘n’
>> indicates
>> how
>> much
>> the
>> program
>> is
>> efficiently
>> accelerated
>> by parallel
>> processing.
>> ‘En’
>> is given
>> by the
>> following
>> formulae.
>> Although
>> their
>> derivation
>> processes
>> are
>> different
>> depending
>> on strong
>> and
>> weak
>> scaling,
>> derived
>> formulae
>> are
>> the
>> same.
>> From
>> the
>> estimated
>> time,
>> my parallel
>> efficiency
>> using
>> Amdahl's
>> law
>> on the
>> current
>> old
>> cluster
>> was
>> 52.7%.
>> So is
>> my results
>> acceptable?
>> For
>> the
>> large
>> data
>> set,
>> if using
>> 2205
>> nodes
>> (2205X8cores),
>> my expected
>> parallel
>> efficiency
>> is only
>> 0.5%.
>> The
>> proposal
>> recommends
>> value
>> of >
>> 50%.
>> The
>> problem
>> with
>> this
>> analysis
>> is that
>> the
>> estimated
>> serial
>> fraction
>> from
>> Amdahl's
>> Law
>> changes
>> as a
>> function
>> of problem
>> size,
>> so you
>> cannot
>> take
>> the
>> strong
>> scaling
>> from
>> one
>> problem
>> and
>> apply
>> it to
>> another
>> without
>> a
>> model
>> of this
>> dependence.
>>
>> Weak
>> scaling
>> does
>> model
>> changes
>> with
>> problem
>> size,
>> so I
>> would
>> measure
>> weak
>> scaling
>> on your
>> current
>> cluster,
>> and
>> extrapolate
>> to the
>> big
>> machine.
>> I
>> realize
>> that
>> this
>> does
>> not
>> make
>> sense
>> for
>> many
>> scientific
>> applications,
>> but
>> neither
>> does
>> requiring
>> a
>> certain
>> parallel
>> efficiency.
>>
>> Ok I
>> check
>> the
>> results
>> for
>> my
>> weak
>> scaling
>> it is
>> even
>> worse
>> for
>> the
>> expected
>> parallel
>> efficiency.
>> From
>> the
>> formula
>> used,
>> it's
>> obvious
>> it's
>> doing
>> some
>> sort
>> of
>> exponential
>> extrapolation
>> decrease.
>> So
>> unless I
>> can
>> achieve
>> a
>> near
>> > 90%
>> speed
>> up
>> when
>> I
>> double the
>> cores
>> and
>> problem
>> size
>> for
>> my
>> current
>> 48/96
>> cores
>> setup, extrapolating
>> from
>> about
>> 96
>> nodes
>> to
>> 10,000 nodes
>> will
>> give
>> a
>> much
>> lower
>> expected
>> parallel
>> efficiency
>> for
>> the
>> new case.
>>
>> However,
>> it's
>> mentioned
>> in
>> the
>> FAQ
>> that
>> due
>> to
>> memory requirement,
>> it's
>> impossible
>> to
>> get
>> >90%
>> speed
>> when
>> I
>> double the
>> cores
>> and
>> problem
>> size
>> (ie
>> linear increase
>> in
>> performance),
>> which
>> means
>> that
>> I
>> can't
>> get
>> >90%
>> speed
>> up
>> when
>> I
>> double the
>> cores
>> and
>> problem
>> size
>> for
>> my
>> current
>> 48/96
>> cores
>> setup. Is
>> that so?
>>
>> What
>> is the
>> output of
>> -ksp_view
>> -log_summary
>> on the
>> problem
>> and then
>> on the
>> problem
>> doubled
>> in size
>> and
>> number of
>> processors?
>>
>> Barry
>>
>> Hi,
>>
>> I have
>> attached the
>> output
>>
>> 48 cores: log48
>> 96 cores: log96
>>
>> There are 2
>> solvers - The
>> momentum
>> linear eqn
>> uses bcgs,
>> while the
>> Poisson eqn
>> uses hypre
>> BoomerAMG.
>>
>> Problem size
>> doubled from
>> 158x266x150
>> to 158x266x300.
>>
>> So is
>> it
>> fair
>> to
>> say
>> that
>> the
>> main
>> problem
>> does
>> not
>> lie
>> in my
>> programming
>> skills,
>> but
>> rather the
>> way
>> the
>> linear equations
>> are
>> solved?
>>
>> Thanks.
>>
>>
>> Thanks,
>>
>>
>>
>>
>> Matt
>> Is it
>> possible
>> for
>> this
>> type
>> of scaling
>> in PETSc
>> (>50%),
>> when
>> using
>> 17640
>> (2205X8)
>> cores?
>> Btw,
>> I
>> do not
>> have
>> access
>> to the
>> system.
>>
>>
>>
>> Sent
>> using
>> CloudMagic
>> Email
>>
>>
>>
>> --
>> What
>> most
>> experimenters
>> take
>> for
>> granted
>> before
>> they
>> begin
>> their
>> experiments
>> is infinitely
>> more
>> interesting
>> than
>> any
>> results
>> to which
>> their
>> experiments
>> lead.
>> -- Norbert
>> Wiener
>>
>> <log48.txt><log96.txt>
>>
>> <log48_10.txt><log48.txt><log96.txt>
>>
>> <log96_100.txt><log48_100.txt>
>>
>> <log96_100_2.txt><log48_100_2.txt>
>>
>> <log64_100.txt><log8_100.txt>
>>
>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to
>> which their experiments lead.
>> -- Norbert Wiener
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151103/5df47888/attachment-0001.html>
More information about the petsc-users
mailing list