[petsc-users] Configuring petsc with MPI on ubuntu quad-core

Wed Feb 2 23:18:46 CST 2011

On Wed, Feb 2, 2011 at 11:13 PM, Vijay S. Mahadevan <vijay.m at gmail.com>wrote:

> Barry,
>
> I understand what you are saying but which example/options then is the
> best one to compute the scalability in a multi-core machine ? I chose
> the nonlinear diffusion problem specifically because of its inherent
> stiffness that could lead probably provide noticeable scalability in a
> multi-core system. From your experience, do you think there is another
> example program that will demonstrate this much more rigorously or
> clearly ? Btw, I dont get good speedup even for 2 processes with
> ex20.c and that was the original motivation for this thread.
>

Very simply, Barry said your coarse grid is way too big. Make it smaller
and you will see speedup.

   Matt

> Satish. I configured with --download-mpich now without the
> mpich-device. The results are given above. I will try with the options
> you provided although I dont entirely understand what they mean, which
> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
> ?
>
> Vijay
>
> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> >   Ok, everything makes sense. Looks like you are using two level
> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant
> -mg_coarse_redundant_pc_type lu  This means it is solving the coarse grid
> problem redundantly on each process (each process is solving the entire
> coarse grid solve using LU factorization). The time for the factorization is
> (in the two process case)
> >
> > MatLUFactorNum        14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 37 41  0  0  0  74 82  0  0  0  1307
> > MatILUFactorSym        7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 7.0e+00  0  0  0  0  1   0  0  0  0  2     0
> >
> > which is 74 percent of the total solve time (and 84 percent of the
> flops).   When 3/4th of the entire run is not parallel at all you cannot
> expect much speedup.  If you run with -snes_view it will display exactly the
> solver being used. You cannot expect to understand the performance if you
> don't understand what the solver is actually doing. Using a 20 by 20 by 20
> coarse grid is generally a bad idea since the code spends most of the time
> there, stick with something like 5 by 5 by 5.
> >
> >  Suggest running with the default grid and -dmmg_nlevels 5 now the
> percent in the coarse solve will be a trivial percent of the run time.
> >
> >  You should get pretty good speed up for 2 processes but not much better
> speedup for four processes because as Matt noted the computation is memory
> bandwidth limited;
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers.
> Note also that this is running multigrid which is a fast solver, but doesn't
> parallel scale as well many slow algorithms. For example if you run
> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
> processors but crummy speed.
> >
> >  Barry
> >
> >
> >
> > On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
> >
> >> Barry,
> >>
> >> Please find attached the patch for the minor change to control the
> >> number of elements from command line for snes/ex20.c. I know that this
> >> can be achieved with -grid_x etc from command_line but thought this
> >> just made the typing for the refinement process a little easier. I
> >> apologize if there was any confusion.
> >>
> >> Also, find attached the full log summaries for -np=1 and -np=2. Thanks.
> >>
> >> Vijay
> >>
> >> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >>>
> >>>  We need all the information from -log_summary to see what is going on.
> >>>
> >>>  Not sure what -grid 20 means but don't expect any good parallel
> performance with less than at least 10,000 unknowns per process.
> >>>
> >>>   Barry
> >>>
> >>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
> >>>
> >>>> Here's the performance statistic on 1 and 2 processor runs.
> >>>>
> >>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20
> -log_summary
> >>>>
> >>>>                         Max       Max/Min        Avg      Total
> >>>> Time (sec):           8.452e+00      1.00000   8.452e+00
> >>>> Objects:              1.470e+02      1.00000   1.470e+02
> >>>> Flops:                5.045e+09      1.00000   5.045e+09  5.045e+09
> >>>> Flops/sec:            5.969e+08      1.00000   5.969e+08  5.969e+08
> >>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> >>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> >>>> MPI Reductions:       4.440e+02      1.00000
> >>>>
> >>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20
> -log_summary
> >>>>
> >>>>                         Max       Max/Min        Avg      Total
> >>>> Time (sec):           7.851e+00      1.00000   7.851e+00
> >>>> Objects:              2.000e+02      1.00000   2.000e+02
> >>>> Flops:                4.670e+09      1.00580   4.657e+09  9.313e+09
> >>>> Flops/sec:            5.948e+08      1.00580   5.931e+08  1.186e+09
> >>>> MPI Messages:         7.965e+02      1.00000   7.965e+02  1.593e+03
> >>>> MPI Message Lengths:  1.412e+07      1.00000   1.773e+04  2.824e+07
> >>>> MPI Reductions:       1.046e+03      1.00000
> >>>>
> >>>> I am not entirely sure if I can make sense out of that statistic but
> >>>> if there is something more you need, please feel free to let me know.
> >>>>
> >>>> Vijay
> >>>>
> >>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
> >>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <
> vijay.m at gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Matt,
> >>>>>>
> >>>>>> The -with-debugging=1 option is certainly not meant for performance
> >>>>>> studies but I didn't expect it to yield the same cpu time as a
> single
> >>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take
> >>>>>> approximately the same amount of time for computation of solution.
> But
> >>>>>> I am currently configuring without debugging symbols and shall let
> you
> >>>>>> know what that yields.
> >>>>>>
> >>>>>> On a similar note, is there something extra that needs to be done to
> >>>>>> make use of multi-core machines while using MPI ? I am not sure if
> >>>>>> this is even related to PETSc but could be an MPI configuration
> option
> >>>>>> that maybe either I or the configure process is missing. All ideas
> are
> >>>>>> much appreciated.
> >>>>>
> >>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On
> most
> >>>>> cheap multicore machines, there is a single memory bus, and thus
> using more
> >>>>> cores gains you very little extra performance. I still suspect you
> are not
> >>>>> actually
> >>>>> running in parallel, because you usually see a small speedup. That is
> why I
> >>>>> suggested looking at -log_summary since it tells you how many
> processes were
> >>>>> run and breaks down the time.
> >>>>>    Matt
> >>>>>
> >>>>>>
> >>>>>> Vijay
> >>>>>>
> >>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
> >>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <
> vijay.m at gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am trying to configure my petsc install with an MPI installation
> to
> >>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
> >>>>>>>> eventhough the configure/make process went through without
> problems,
> >>>>>>>> the scalability of the programs don't seem to reflect what I
> expected.
> >>>>>>>> My configure options are
> >>>>>>>>
> >>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
> --download-mpich=1
> >>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
> >>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1
> >>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
> >>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
> >>>>>>>> --with-debugging=1 --with-errorchecking=yes
> >>>>>>>
> >>>>>>> 1) For performance studies, make a build using --with-debugging=0
> >>>>>>> 2) Look at -log_summary for a breakdown of performance
> >>>>>>>    Matt
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Is there something else that needs to be done as part of the
> configure
> >>>>>>>> process to enable a decent scaling ? I am only comparing programs
> with
> >>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately
> the
> >>>>>>>> same time as noted from -log_summary. If it helps, I've been
> testing
> >>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom
> >>>>>>>> -grid parameter from command-line to control the number of
> unknowns.
> >>>>>>>>
> >>>>>>>> If there is something you've witnessed before in this
> configuration or
> >>>>>>>> if you need anything else to analyze the problem, do let me know.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Vijay
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> What most experimenters take for granted before they begin their
> >>>>>>> experiments
> >>>>>>> is infinitely more interesting than any results to which their
> >>>>>>> experiments
> >>>>>>> lead.
> >>>>>>> -- Norbert Wiener
> >>>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> What most experimenters take for granted before they begin their
> experiments
> >>>>> is infinitely more interesting than any results to which their
> experiments
> >>>>> lead.
> >>>>> -- Norbert Wiener
> >>>>>
> >>>
> >>>
> >> <ex20.patch><ex20_np1.out><ex20_np2.out>
> >
> >
>

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110202/c6c33710/attachment-0001.htm>