[petsc-users] Scaling with number of cores
Matthew Knepley
knepley at gmail.com
Tue Nov 3 06:52:30 CST 2015
On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <zonexo at gmail.com> wrote:
> Hi,
>
> I tried and have attached the log.
>
> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
> some null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate?
Yes, you need to attach the constant null space to the matrix.
Thanks,
Matt
>
> Thank you
>
> Yours sincerely,
>
> TAY wee-beng
>
> On 3/11/2015 12:45 PM, Barry Smith wrote:
>
>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<zonexo at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I tried :
>>>
>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>>
>>> 2. -poisson_pc_type gamg
>>>
>> Run with -poisson_ksp_monitor_true_residual
>> -poisson_ksp_monitor_converged_reason
>> Does your poisson have Neumann boundary conditions? Do you have any zeros
>> on the diagonal for the matrix (you shouldn't).
>>
>> There may be something wrong with your poisson discretization that was
>> also messing up hypre
>>
>>
>>
>> Both options give:
>>>
>>> 1 0.00150000 0.00000000 0.00000000 1.00000000
>>> NaN NaN NaN
>>> M Diverged but why?, time = 2
>>> reason = -9
>>>
>>> How can I check what's wrong?
>>>
>>> Thank you
>>>
>>> Yours sincerely,
>>>
>>> TAY wee-beng
>>>
>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>
>>>> hypre is just not scaling well here. I do not know why. Since hypre
>>>> is a block box for us there is no way to determine why the poor scaling.
>>>>
>>>> If you make the same two runs with -pc_type gamg there will be a
>>>> lot more information in the log summary about in what routines it is
>>>> scaling well or poorly.
>>>>
>>>> Barry
>>>>
>>>>
>>>>
>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<zonexo at gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have attached the 2 files.
>>>>>
>>>>> Thank you
>>>>>
>>>>> Yours sincerely,
>>>>>
>>>>> TAY wee-beng
>>>>>
>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>
>>>>>> Run (158/2)x(266/2)x(150/2) grid on 8 processes and then
>>>>>> (158)x(266)x(150) on 64 processors and send the two -log_summary results
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<zonexo at gmail.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have attached the new results.
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> Yours sincerely,
>>>>>>>
>>>>>>> TAY wee-beng
>>>>>>>
>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>
>>>>>>>> Run without the -momentum_ksp_view -poisson_ksp_view and send
>>>>>>>> the new results
>>>>>>>>
>>>>>>>>
>>>>>>>> You can see from the log summary that the PCSetUp is taking a
>>>>>>>> much smaller percentage of the time meaning that it is reusing the
>>>>>>>> preconditioner and not rebuilding it each time.
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>> Something makes no sense with the output: it gives
>>>>>>>>
>>>>>>>> KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165
>>>>>>>>
>>>>>>>> 90% of the time is in the solve but there is no significant amount
>>>>>>>> of time in other events of the code which is just not possible. I hope it
>>>>>>>> is due to your IO.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<zonexo at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96
>>>>>>>>> cores.
>>>>>>>>>
>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want
>>>>>>>>> to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>
>>>>>>>>> Why does the number of processes increase so much? Is there
>>>>>>>>> something wrong with my coding? Seems to be so too for my new run.
>>>>>>>>>
>>>>>>>>> Thank you
>>>>>>>>>
>>>>>>>>> Yours sincerely,
>>>>>>>>>
>>>>>>>>> TAY wee-beng
>>>>>>>>>
>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>
>>>>>>>>>> If you are doing many time steps with the same linear solver
>>>>>>>>>> then you MUST do your weak scaling studies with MANY time steps since the
>>>>>>>>>> setup time of AMG only takes place in the first stimestep. So run both 48
>>>>>>>>>> and 96 processes with the same large number of time steps.
>>>>>>>>>>
>>>>>>>>>> Barry
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<zonexo at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new
>>>>>>>>>>> log for 48cores (log48), together with the 96cores log (log96).
>>>>>>>>>>>
>>>>>>>>>>> Why does the number of processes increase so much? Is there
>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>>
>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>>
>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for
>>>>>>>>>>> 10 timesteps (log48_10). Is it building the preconditioner at every
>>>>>>>>>>> timestep?
>>>>>>>>>>>
>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>>
>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>>
>>>>>>>>>>> Thank you
>>>>>>>>>>>
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>
>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You used gmres with 48 processes but richardson with 96. You
>>>>>>>>>>>> need to be careful and make sure you don't change the solvers when you
>>>>>>>>>>>> change the number of processors since you can get very different
>>>>>>>>>>>> inconsistent results
>>>>>>>>>>>>
>>>>>>>>>>>> Anyways all the time is being spent in the BoomerAMG
>>>>>>>>>>>> algebraic multigrid setup and it is is scaling badly. When you double the
>>>>>>>>>>>> problem size and number of processes it went from 3.2445e+01 to 4.3599e+02
>>>>>>>>>>>> seconds.
>>>>>>>>>>>>
>>>>>>>>>>>> PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11
>>>>>>>>>>>>
>>>>>>>>>>>> PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2
>>>>>>>>>>>>
>>>>>>>>>>>> Now is the Poisson problem changing at each timestep or can
>>>>>>>>>>>> you use the same preconditioner built with BoomerAMG for all the time
>>>>>>>>>>>> steps? Algebraic multigrid has a large set up time that you often doesn't
>>>>>>>>>>>> matter if you have many time steps but if you have to rebuild it each
>>>>>>>>>>>> timestep it is too large?
>>>>>>>>>>>>
>>>>>>>>>>>> You might also try -pc_type gamg and see how PETSc's
>>>>>>>>>>>> algebraic multigrid scales for your problem/machine.
>>>>>>>>>>>>
>>>>>>>>>>>> Barry
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<zonexo at gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<zonexo at gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<
>>>>>>>>>>>>>>>> zonexo at gmail.com> wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the
>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am trying to write
>>>>>>>>>>>>>>>> a proposal to use a supercomputer.
>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory
>>>>>>>>>>>>>>>> per node)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my
>>>>>>>>>>>>>>>> current code with my current set of data, and there is a formula to
>>>>>>>>>>>>>>>> calculate the estimated parallel efficiency when using the new large set of
>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed time
>>>>>>>>>>>>>>>> varies with the number of processors for a fixed
>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time
>>>>>>>>>>>>>>>> varies with the number of processors for a
>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current
>>>>>>>>>>>>>>>> cluster, giving 140 and 90 mins respectively. This is classified as strong
>>>>>>>>>>>>>>>> scaling.
>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of
>>>>>>>>>>>>>>>> parallelism ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is
>>>>>>>>>>>>>>>> given by the following formulae. Although their
>>>>>>>>>>>>>>>> derivation processes are different depending on strong and
>>>>>>>>>>>>>>>> weak scaling, derived formulae are the
>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>> From the estimated time, my parallel efficiency using
>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores),
>>>>>>>>>>>>>>>> my expected parallel efficiency is only 0.5%. The proposal recommends value
>>>>>>>>>>>>>>>> of > 50%.
>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated serial
>>>>>>>>>>>>>>>> fraction from Amdahl's Law changes as a function
>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling from
>>>>>>>>>>>>>>>> one problem and apply it to another without a
>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I
>>>>>>>>>>>>>>>> would measure weak scaling on your current
>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize that
>>>>>>>>>>>>>>>> this does not make sense for many scientific
>>>>>>>>>>>>>>>> applications, but neither does requiring a certain parallel
>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse
>>>>>>>>>>>>>>> for the expected parallel efficiency. From the formula used, it's obvious
>>>>>>>>>>>>>>> it's doing some sort of exponential extrapolation decrease. So unless I can
>>>>>>>>>>>>>>> achieve a near > 90% speed up when I double the cores and problem size for
>>>>>>>>>>>>>>> my current 48/96 cores setup, extrapolating from about 96 nodes to
>>>>>>>>>>>>>>> 10,000 nodes will give a much lower expected parallel efficiency for the
>>>>>>>>>>>>>>> new case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory
>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I double the cores and
>>>>>>>>>>>>>>> problem size (ie linear increase in performance), which means that I can't
>>>>>>>>>>>>>>> get >90% speed up when I double the cores and problem size for my current
>>>>>>>>>>>>>>> 48/96 cores setup. Is that so?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What is the output of -ksp_view -log_summary on the
>>>>>>>>>>>>>> problem and then on the problem doubled in size and number of processors?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Barry
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>>
>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, while
>>>>>>>>>>>>> the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in my
>>>>>>>>>>>>>>> programming skills, but rather the way the linear equations are solved?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matt
>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%),
>>>>>>>>>>>>>>>> when using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin
>>>>>>>>>>>>>>>> their experiments is infinitely more interesting than any results to which
>>>>>>>>>>>>>>>> their experiments lead.
>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>>>>
>>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>>>>
>>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>>>>
>>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>>>>
>>>>>> <log64_100.txt><log8_100.txt>
>>>>>
>>>>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151103/94a1addd/attachment.html>
More information about the petsc-users
mailing list