[petsc-users] Scaling with number of cores

Mark Adams mfadams at lbl.gov
Tue Nov 3 08:00:20 CST 2015


If you clean the RHS of the null space then you can probably (only)
use "*-mg_coarse_pc_type
svd*", but you need this or the lu coarse grid solver will have problems.
If your solution starts drifting then you need to set the null space but
this is often not needed.

Also, after you get this working you want to check these PCSetUp times.
This setup will not scale great but this behavior indicates that there is
something wrong.  Hypre's default parameters are tune for 2D problems, you
have a 3D problem, I assume.  GAMG should be fine.  As a rule of thumb the
PCSetup should not be much more than a solve.  An easy 3D Poisson solve
might require relatively  more setup and a hard 2D problem might require
relatively less.

On Tue, Nov 3, 2015 at 8:01 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Tue, Nov 3, 2015 at 6:58 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>
>>
>> On 3/11/2015 8:52 PM, Matthew Knepley wrote:
>>
>> On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I tried and have attached the log.
>>>
>>> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify
>>> some null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
>>
>>
>> Yes, you need to attach the constant null space to the matrix.
>>
>>   Thanks,
>>
>>      Matt
>>
>> Ok so can you point me to a suitable example so that I know which one to
>> use specifically?
>>
>
>
> https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
>
>   Matt
>
>
>> Thanks.
>>
>>
>>
>>>
>>> Thank you
>>>
>>> Yours sincerely,
>>>
>>> TAY wee-beng
>>>
>>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>>
>>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng< <zonexo at gmail.com>
>>>>> zonexo at gmail.com>  wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I tried :
>>>>>
>>>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>>>>
>>>>> 2. -poisson_pc_type gamg
>>>>>
>>>>     Run with -poisson_ksp_monitor_true_residual
>>>> -poisson_ksp_monitor_converged_reason
>>>> Does your poisson have Neumann boundary conditions? Do you have any
>>>> zeros on the diagonal for the matrix (you shouldn't).
>>>>
>>>>    There may be something wrong with your poisson discretization that
>>>> was also messing up hypre
>>>>
>>>>
>>>>
>>>> Both options give:
>>>>>
>>>>>     1      0.00150000      0.00000000      0.00000000 1.00000000
>>>>>        NaN             NaN             NaN
>>>>> M Diverged but why?, time =            2
>>>>> reason =           -9
>>>>>
>>>>> How can I check what's wrong?
>>>>>
>>>>> Thank you
>>>>>
>>>>> Yours sincerely,
>>>>>
>>>>> TAY wee-beng
>>>>>
>>>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>>>
>>>>>>     hypre is just not scaling well here. I do not know why. Since
>>>>>> hypre is a block box for us there is no way to determine why the poor
>>>>>> scaling.
>>>>>>
>>>>>>     If you make the same two runs with -pc_type gamg there will be a
>>>>>> lot more information in the log summary about in what routines it is
>>>>>> scaling well or poorly.
>>>>>>
>>>>>>    Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng< <zonexo at gmail.com>
>>>>>>> zonexo at gmail.com>  wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have attached the 2 files.
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> Yours sincerely,
>>>>>>>
>>>>>>> TAY wee-beng
>>>>>>>
>>>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>>>
>>>>>>>>    Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then
>>>>>>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary results
>>>>>>>>
>>>>>>>>    Barry
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<zonexo at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have attached the new results.
>>>>>>>>>
>>>>>>>>> Thank you
>>>>>>>>>
>>>>>>>>> Yours sincerely,
>>>>>>>>>
>>>>>>>>> TAY wee-beng
>>>>>>>>>
>>>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>>>
>>>>>>>>>>    Run without the -momentum_ksp_view -poisson_ksp_view and send
>>>>>>>>>> the new results
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    You can see from the log summary that the PCSetUp is taking a
>>>>>>>>>> much smaller percentage of the time meaning that it is reusing the
>>>>>>>>>> preconditioner and not rebuilding it each time.
>>>>>>>>>>
>>>>>>>>>> Barry
>>>>>>>>>>
>>>>>>>>>>    Something makes no sense with the output: it gives
>>>>>>>>>>
>>>>>>>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04
>>>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24  90100 66100 24   165
>>>>>>>>>>
>>>>>>>>>> 90% of the time is in the solve but there is no significant
>>>>>>>>>> amount of time in other events of the code which is just not possible. I
>>>>>>>>>> hope it is due to your IO.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<zonexo at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96
>>>>>>>>>>> cores.
>>>>>>>>>>>
>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>>
>>>>>>>>>>> Why does the number of processes increase so much? Is there
>>>>>>>>>>> something wrong with my coding? Seems to be so too for my new run.
>>>>>>>>>>>
>>>>>>>>>>> Thank you
>>>>>>>>>>>
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>
>>>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>>>
>>>>>>>>>>>>    If you are doing many time steps with the same linear solver
>>>>>>>>>>>> then you MUST do your weak scaling studies with MANY time steps since the
>>>>>>>>>>>> setup time of AMG only takes place in the first stimestep. So run both 48
>>>>>>>>>>>> and 96 processes with the same large number of time steps.
>>>>>>>>>>>>
>>>>>>>>>>>>    Barry
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng< <zonexo at gmail.com>
>>>>>>>>>>>>> zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new
>>>>>>>>>>>>> log for 48cores (log48), together with the 96cores log (log96).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Why does the number of processes increase so much? Is there
>>>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I
>>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run
>>>>>>>>>>>>> for 10 timesteps (log48_10). Is it building the preconditioner at every
>>>>>>>>>>>>> timestep?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>>
>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>    You used gmres with 48 processes but richardson with 96.
>>>>>>>>>>>>>> You need to be careful and make sure you don't change the solvers when you
>>>>>>>>>>>>>> change the number of processors since you can get very different
>>>>>>>>>>>>>> inconsistent results
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     Anyways all the time is being spent in the BoomerAMG
>>>>>>>>>>>>>> algebraic multigrid setup and it is is scaling badly. When you double the
>>>>>>>>>>>>>> problem size and number of processes it went from 3.2445e+01 to 4.3599e+02
>>>>>>>>>>>>>> seconds.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0
>>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0
>>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Now is the Poisson problem changing at each timestep or
>>>>>>>>>>>>>> can you use the same preconditioner built with BoomerAMG for all the time
>>>>>>>>>>>>>> steps? Algebraic multigrid has a large set up time that you often doesn't
>>>>>>>>>>>>>> matter if you have many time steps but if you have to rebuild it each
>>>>>>>>>>>>>> timestep it is too large?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    You might also try -pc_type gamg and see how PETSc's
>>>>>>>>>>>>>> algebraic multigrid scales for your problem/machine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Barry
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng< <zonexo at gmail.com>
>>>>>>>>>>>>>>> zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<
>>>>>>>>>>>>>>>>> <zonexo at gmail.com>zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<
>>>>>>>>>>>>>>>>>> <zonexo at gmail.com>zonexo at gmail.com>  wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the
>>>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am trying to write
>>>>>>>>>>>>>>>>>> a proposal to use a supercomputer.
>>>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of
>>>>>>>>>>>>>>>>>> memory per node)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my
>>>>>>>>>>>>>>>>>> current code with my current set of data, and there is a formula to
>>>>>>>>>>>>>>>>>> calculate the estimated parallel efficiency when using the new large set of
>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed
>>>>>>>>>>>>>>>>>> time varies with the number of processors for a fixed
>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time
>>>>>>>>>>>>>>>>>> varies with the number of processors for a
>>>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current
>>>>>>>>>>>>>>>>>> cluster, giving 140 and 90 mins respectively. This is classified as strong
>>>>>>>>>>>>>>>>>> scaling.
>>>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of
>>>>>>>>>>>>>>>>>> parallelism ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is
>>>>>>>>>>>>>>>>>> given by the following formulae. Although their
>>>>>>>>>>>>>>>>>> derivation processes are different depending on strong
>>>>>>>>>>>>>>>>>> and weak scaling, derived formulae are the
>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>  From the estimated time, my parallel efficiency using
>>>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes
>>>>>>>>>>>>>>>>>> (2205X8cores), my expected parallel efficiency is only 0.5%. The proposal
>>>>>>>>>>>>>>>>>> recommends value of > 50%.
>>>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated
>>>>>>>>>>>>>>>>>> serial fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling
>>>>>>>>>>>>>>>>>> from one problem and apply it to another without a
>>>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I
>>>>>>>>>>>>>>>>>> would measure weak scaling on your current
>>>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize
>>>>>>>>>>>>>>>>>> that this does not make sense for many scientific
>>>>>>>>>>>>>>>>>> applications, but neither does requiring a certain
>>>>>>>>>>>>>>>>>> parallel efficiency.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even
>>>>>>>>>>>>>>>>> worse for the expected parallel efficiency. From the formula used, it's
>>>>>>>>>>>>>>>>> obvious it's doing some sort of exponential extrapolation decrease. So
>>>>>>>>>>>>>>>>> unless I can achieve a near > 90% speed up when I double the cores and
>>>>>>>>>>>>>>>>> problem size for my current 48/96 cores setup,     extrapolating from about
>>>>>>>>>>>>>>>>> 96 nodes to 10,000 nodes will give a much lower expected parallel
>>>>>>>>>>>>>>>>> efficiency for the new case.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory
>>>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I double the cores and
>>>>>>>>>>>>>>>>> problem size (ie linear increase in performance), which means that I can't
>>>>>>>>>>>>>>>>> get >90% speed up when I double the cores and problem size for my current
>>>>>>>>>>>>>>>>> 48/96 cores setup. Is that so?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    What is the output of -ksp_view -log_summary on the
>>>>>>>>>>>>>>>> problem and then on the problem doubled in size and number of processors?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    Barry
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs,
>>>>>>>>>>>>>>> while the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in
>>>>>>>>>>>>>>>>> my programming skills, but rather the way the linear equations are solved?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>       Matt
>>>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%),
>>>>>>>>>>>>>>>>>> when using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> What most experimenters take for granted before they
>>>>>>>>>>>>>>>>>> begin their experiments is infinitely more interesting than any results to
>>>>>>>>>>>>>>>>>> which their experiments lead.
>>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>>>>>>
>>>>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>>>>>>
>>>>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>>>>>>
>>>>>>>> <log64_100.txt><log8_100.txt>
>>>>>>>
>>>>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20151103/f84308ba/attachment-0001.html>


More information about the petsc-users mailing list