[petsc-users] Scaling with number of cores

Tue Nov 3 07:01:18 CST 2015

On Tue, Nov 3, 2015 at 6:58 AM, TAY wee-beng <zonexo at gmail.com> wrote:

>
> On 3/11/2015 8:52 PM, Matthew Knepley wrote:
>
> On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>
>> Hi,
>>
>> I tried and have attached the log.
>>
>> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
>> some null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
>
>
> Yes, you need to attach the constant null space to the matrix.
>
>   Thanks,
>
>      Matt
>
> Ok so can you point me to a suitable example so that I know which one to
> use specifically?
>

https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761

  Matt

> Thanks.
>
>
>
>>
>> Thank you
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>
>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng< <zonexo at gmail.com>
>>>> zonexo at gmail.com>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> I tried :
>>>>
>>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>>>
>>>> 2. -poisson_pc_type gamg
>>>>
>>>     Run with -poisson_ksp_monitor_true_residual
>>> -poisson_ksp_monitor_converged_reason
>>> Does your poisson have Neumann boundary conditions? Do you have any
>>> zeros on the diagonal for the matrix (you shouldn't).
>>>
>>>    There may be something wrong with your poisson discretization that
>>> was also messing up hypre
>>>
>>>
>>>
>>> Both options give:
>>>>
>>>>     1      0.00150000      0.00000000      0.00000000 1.00000000
>>>>      NaN             NaN             NaN
>>>> M Diverged but why?, time =            2
>>>> reason =           -9
>>>>
>>>> How can I check what's wrong?
>>>>
>>>> Thank you
>>>>
>>>> Yours sincerely,
>>>>
>>>> TAY wee-beng
>>>>
>>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>>
>>>>>     hypre is just not scaling well here. I do not know why. Since
>>>>> hypre is a block box for us there is no way to determine why the poor
>>>>> scaling.
>>>>>
>>>>>     If you make the same two runs with -pc_type gamg there will be a
>>>>> lot more information in the log summary about in what routines it is
>>>>> scaling well or poorly.
>>>>>
>>>>>    Barry
>>>>>
>>>>>
>>>>>
>>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng< <zonexo at gmail.com>
>>>>>> zonexo at gmail.com>  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have attached the 2 files.
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Yours sincerely,
>>>>>>
>>>>>> TAY wee-beng
>>>>>>
>>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>>
>>>>>>>    Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then
>>>>>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary results
>>>>>>>
>>>>>>>    Barry
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<zonexo at gmail.com>  wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have attached the new results.
>>>>>>>>
>>>>>>>> Thank you
>>>>>>>>
>>>>>>>> Yours sincerely,
>>>>>>>>
>>>>>>>> TAY wee-beng
>>>>>>>>
>>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>>
>>>>>>>>>    Run without the -momentum_ksp_view -poisson_ksp_view and send
>>>>>>>>> the new results
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    You can see from the log summary that the PCSetUp is taking a
>>>>>>>>> much smaller percentage of the time meaning that it is reusing the
>>>>>>>>> preconditioner and not rebuilding it each time.
>>>>>>>>>
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>>>    Something makes no sense with the output: it gives
>>>>>>>>>
>>>>>>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
>>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24  90100 66100 24   165
>>>>>>>>>
>>>>>>>>> 90% of the time is in the solve but there is no significant amount
>>>>>>>>> of time in other events of the code which is just not possible. I hope it
>>>>>>>>> is due to your IO.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<zonexo at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96
>>>>>>>>>> cores.
>>>>>>>>>>
>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>
>>>>>>>>>> Why does the number of processes increase so much? Is there
>>>>>>>>>> something wrong with my coding? Seems to be so too for my new run.
>>>>>>>>>>
>>>>>>>>>> Thank you
>>>>>>>>>>
>>>>>>>>>> Yours sincerely,
>>>>>>>>>>
>>>>>>>>>> TAY wee-beng
>>>>>>>>>>
>>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>>
>>>>>>>>>>>    If you are doing many time steps with the same linear solver
>>>>>>>>>>> then you MUST do your weak scaling studies with MANY time steps since the
>>>>>>>>>>> setup time of AMG only takes place in the first stimestep. So run both 48
>>>>>>>>>>> and 96 processes with the same large number of time steps.
>>>>>>>>>>>
>>>>>>>>>>>    Barry
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng< <zonexo at gmail.com>
>>>>>>>>>>>> zonexo at gmail.com>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new
>>>>>>>>>>>> log for 48cores (log48), together with the 96cores log (log96).
>>>>>>>>>>>>
>>>>>>>>>>>> Why does the number of processes increase so much? Is there
>>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>>>
>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>>>
>>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for
>>>>>>>>>>>> 10 timesteps (log48_10). Is it building the preconditioner at every
>>>>>>>>>>>> timestep?
>>>>>>>>>>>>
>>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>>>
>>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you
>>>>>>>>>>>>
>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>
>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>    You used gmres with 48 processes but richardson with 96.
>>>>>>>>>>>>> You need to be careful and make sure you don't change the solvers when you
>>>>>>>>>>>>> change the number of processors since you can get very different
>>>>>>>>>>>>> inconsistent results
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Anyways all the time is being spent in the BoomerAMG
>>>>>>>>>>>>> algebraic multigrid setup and it is is scaling badly. When you double the
>>>>>>>>>>>>> problem size and number of processes it went from 3.2445e+01 to 4.3599e+02
>>>>>>>>>>>>> seconds.
>>>>>>>>>>>>>
>>>>>>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0
>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>>>>>>>
>>>>>>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0
>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Now is the Poisson problem changing at each timestep or can
>>>>>>>>>>>>> you use the same preconditioner built with BoomerAMG for all the time
>>>>>>>>>>>>> steps? Algebraic multigrid has a large set up time that you often doesn't
>>>>>>>>>>>>> matter if you have many time steps but if you have to rebuild it each
>>>>>>>>>>>>> timestep it is too large?
>>>>>>>>>>>>>
>>>>>>>>>>>>>    You might also try -pc_type gamg and see how PETSc's
>>>>>>>>>>>>> algebraic multigrid scales for your problem/machine.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Barry
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng< <zonexo at gmail.com>
>>>>>>>>>>>>>> zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<
>>>>>>>>>>>>>>>> <zonexo at gmail.com>zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<
>>>>>>>>>>>>>>>>> <zonexo at gmail.com>zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the
>>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am trying to write
>>>>>>>>>>>>>>>>> a proposal to use a supercomputer.
>>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of
>>>>>>>>>>>>>>>>> memory per node)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my
>>>>>>>>>>>>>>>>> current code with my current set of data, and there is a formula to
>>>>>>>>>>>>>>>>> calculate the estimated parallel efficiency when using the new large set of
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed
>>>>>>>>>>>>>>>>> time varies with the number of processors for a fixed
>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time
>>>>>>>>>>>>>>>>> varies with the number of processors for a
>>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current
>>>>>>>>>>>>>>>>> cluster, giving 140 and 90 mins respectively. This is classified as strong
>>>>>>>>>>>>>>>>> scaling.
>>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of
>>>>>>>>>>>>>>>>> parallelism ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is
>>>>>>>>>>>>>>>>> given by the following formulae. Although their
>>>>>>>>>>>>>>>>> derivation processes are different depending on strong and
>>>>>>>>>>>>>>>>> weak scaling, derived formulae are the
>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>  From the estimated time, my parallel efficiency using
>>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores),
>>>>>>>>>>>>>>>>> my expected parallel efficiency is only 0.5%. The proposal recommends value
>>>>>>>>>>>>>>>>> of > 50%.
>>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated
>>>>>>>>>>>>>>>>> serial fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling
>>>>>>>>>>>>>>>>> from one problem and apply it to another without a
>>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I
>>>>>>>>>>>>>>>>> would measure weak scaling on your current
>>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize
>>>>>>>>>>>>>>>>> that this does not make sense for many scientific
>>>>>>>>>>>>>>>>> applications, but neither does requiring a certain
>>>>>>>>>>>>>>>>> parallel efficiency.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse
>>>>>>>>>>>>>>>> for the expected parallel efficiency. From the formula used, it's obvious
>>>>>>>>>>>>>>>> it's doing some sort of exponential extrapolation decrease. So unless I can
>>>>>>>>>>>>>>>> achieve a near > 90% speed up when I double the cores and problem size for
>>>>>>>>>>>>>>>> my current 48/96 cores setup,     extrapolating from about 96 nodes to
>>>>>>>>>>>>>>>> 10,000 nodes will give a much lower expected parallel efficiency for the
>>>>>>>>>>>>>>>> new case.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory
>>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I double the cores and
>>>>>>>>>>>>>>>> problem size (ie linear increase in performance), which means that I can't
>>>>>>>>>>>>>>>> get >90% speed up when I double the cores and problem size for my current
>>>>>>>>>>>>>>>> 48/96 cores setup. Is that so?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    What is the output of -ksp_view -log_summary on the
>>>>>>>>>>>>>>> problem and then on the problem doubled in size and number of processors?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    Barry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs,
>>>>>>>>>>>>>> while the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in
>>>>>>>>>>>>>>>> my programming skills, but rather the way the linear equations are solved?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>       Matt
>>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%),
>>>>>>>>>>>>>>>>> when using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin
>>>>>>>>>>>>>>>>> their experiments is infinitely more interesting than any results to which
>>>>>>>>>>>>>>>>> their experiments lead.
>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>>>>>
>>>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>>>>>
>>>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>>>>>
>>>>>>> <log64_100.txt><log8_100.txt>
>>>>>>
>>>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151103/5d9991d8/attachment-0001.html>