[petsc-users] Configuring petsc with MPI on ubuntu quad-core

Matthew Knepley knepley at gmail.com
Thu Feb 3 11:42:57 CST 2011


On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at gmail.com>wrote:

> Barry,
>
> Sorry about the delay in the reply. I did not have access to the
> system to test out what you said, until now.
>
> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5
>
> processor       time
> 1                      114.2
> 2                      89.45
> 4                      81.01
>

1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from
this data.

2) Do you know the memory bandwidth characteristics of this machine? That is
crucial and
    you cannot begin to understand speedup on it until you do. Please look
this up.

3) Worrying about specifics of the MPI implementation makes no sense until
the basics are nailed down.

   Matt


> The scaleup doesn't seem to be optimal, even with two processors. I am
> wondering if the fault is in the MPI configuration itself. Are these
> results as you would expect ? I can also send you the log_summary for
> all cases if that will help.
>
> Vijay
>
> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
> >
> >> Barry,
> >>
> >> I understand what you are saying but which example/options then is the
> >> best one to compute the scalability in a multi-core machine ? I chose
> >> the nonlinear diffusion problem specifically because of its inherent
> >> stiffness that could lead probably provide noticeable scalability in a
> >> multi-core system. From your experience, do you think there is another
> >> example program that will demonstrate this much more rigorously or
> >> clearly ? Btw, I dont get good speedup even for 2 processes with
> >> ex20.c and that was the original motivation for this thread.
> >
> >   Did you follow my instructions?
> >
> >   Barry
> >
> >>
> >> Satish. I configured with --download-mpich now without the
> >> mpich-device. The results are given above. I will try with the options
> >> you provided although I dont entirely understand what they mean, which
> >> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
> >> ?
> >>
> >> Vijay
> >>
> >> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >>>
> >>>   Ok, everything makes sense. Looks like you are using two level
> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant
> -mg_coarse_redundant_pc_type lu  This means it is solving the coarse grid
> problem redundantly on each process (each process is solving the entire
> coarse grid solve using LU factorization). The time for the factorization is
> (in the two process case)
> >>>
> >>> MatLUFactorNum        14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00
> 0.0e+00 0.0e+00 37 41  0  0  0  74 82  0  0  0  1307
> >>> MatILUFactorSym        7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00 7.0e+00  0  0  0  0  1   0  0  0  0  2     0
> >>>
> >>> which is 74 percent of the total solve time (and 84 percent of the
> flops).   When 3/4th of the entire run is not parallel at all you cannot
> expect much speedup.  If you run with -snes_view it will display exactly the
> solver being used. You cannot expect to understand the performance if you
> don't understand what the solver is actually doing. Using a 20 by 20 by 20
> coarse grid is generally a bad idea since the code spends most of the time
> there, stick with something like 5 by 5 by 5.
> >>>
> >>>  Suggest running with the default grid and -dmmg_nlevels 5 now the
> percent in the coarse solve will be a trivial percent of the run time.
> >>>
> >>>  You should get pretty good speed up for 2 processes but not much
> better speedup for four processes because as Matt noted the computation is
> memory bandwidth limited;
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers.
> Note also that this is running multigrid which is a fast solver, but doesn't
> parallel scale as well many slow algorithms. For example if you run
> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
> processors but crummy speed.
> >>>
> >>>  Barry
> >>>
> >>>
> >>>
> >>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
> >>>
> >>>> Barry,
> >>>>
> >>>> Please find attached the patch for the minor change to control the
> >>>> number of elements from command line for snes/ex20.c. I know that this
> >>>> can be achieved with -grid_x etc from command_line but thought this
> >>>> just made the typing for the refinement process a little easier. I
> >>>> apologize if there was any confusion.
> >>>>
> >>>> Also, find attached the full log summaries for -np=1 and -np=2.
> Thanks.
> >>>>
> >>>> Vijay
> >>>>
> >>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov>
> wrote:
> >>>>>
> >>>>>  We need all the information from -log_summary to see what is going
> on.
> >>>>>
> >>>>>  Not sure what -grid 20 means but don't expect any good parallel
> performance with less than at least 10,000 unknowns per process.
> >>>>>
> >>>>>   Barry
> >>>>>
> >>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
> >>>>>
> >>>>>> Here's the performance statistic on 1 and 2 processor runs.
> >>>>>>
> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20
> -log_summary
> >>>>>>
> >>>>>>                         Max       Max/Min        Avg      Total
> >>>>>> Time (sec):           8.452e+00      1.00000   8.452e+00
> >>>>>> Objects:              1.470e+02      1.00000   1.470e+02
> >>>>>> Flops:                5.045e+09      1.00000   5.045e+09  5.045e+09
> >>>>>> Flops/sec:            5.969e+08      1.00000   5.969e+08  5.969e+08
> >>>>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> >>>>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> >>>>>> MPI Reductions:       4.440e+02      1.00000
> >>>>>>
> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20
> -log_summary
> >>>>>>
> >>>>>>                         Max       Max/Min        Avg      Total
> >>>>>> Time (sec):           7.851e+00      1.00000   7.851e+00
> >>>>>> Objects:              2.000e+02      1.00000   2.000e+02
> >>>>>> Flops:                4.670e+09      1.00580   4.657e+09  9.313e+09
> >>>>>> Flops/sec:            5.948e+08      1.00580   5.931e+08  1.186e+09
> >>>>>> MPI Messages:         7.965e+02      1.00000   7.965e+02  1.593e+03
> >>>>>> MPI Message Lengths:  1.412e+07      1.00000   1.773e+04  2.824e+07
> >>>>>> MPI Reductions:       1.046e+03      1.00000
> >>>>>>
> >>>>>> I am not entirely sure if I can make sense out of that statistic but
> >>>>>> if there is something more you need, please feel free to let me
> know.
> >>>>>>
> >>>>>> Vijay
> >>>>>>
> >>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
> >>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <
> vijay.m at gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Matt,
> >>>>>>>>
> >>>>>>>> The -with-debugging=1 option is certainly not meant for
> performance
> >>>>>>>> studies but I didn't expect it to yield the same cpu time as a
> single
> >>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take
> >>>>>>>> approximately the same amount of time for computation of solution.
> But
> >>>>>>>> I am currently configuring without debugging symbols and shall let
> you
> >>>>>>>> know what that yields.
> >>>>>>>>
> >>>>>>>> On a similar note, is there something extra that needs to be done
> to
> >>>>>>>> make use of multi-core machines while using MPI ? I am not sure if
> >>>>>>>> this is even related to PETSc but could be an MPI configuration
> option
> >>>>>>>> that maybe either I or the configure process is missing. All ideas
> are
> >>>>>>>> much appreciated.
> >>>>>>>
> >>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On
> most
> >>>>>>> cheap multicore machines, there is a single memory bus, and thus
> using more
> >>>>>>> cores gains you very little extra performance. I still suspect you
> are not
> >>>>>>> actually
> >>>>>>> running in parallel, because you usually see a small speedup. That
> is why I
> >>>>>>> suggested looking at -log_summary since it tells you how many
> processes were
> >>>>>>> run and breaks down the time.
> >>>>>>>    Matt
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Vijay
> >>>>>>>>
> >>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <
> knepley at gmail.com> wrote:
> >>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <
> vijay.m at gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I am trying to configure my petsc install with an MPI
> installation to
> >>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
> >>>>>>>>>> eventhough the configure/make process went through without
> problems,
> >>>>>>>>>> the scalability of the programs don't seem to reflect what I
> expected.
> >>>>>>>>>> My configure options are
> >>>>>>>>>>
> >>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
> --download-mpich=1
> >>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
> >>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1
> --download-hypre=1
> >>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
> >>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
> >>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
> >>>>>>>>>
> >>>>>>>>> 1) For performance studies, make a build using --with-debugging=0
> >>>>>>>>> 2) Look at -log_summary for a breakdown of performance
> >>>>>>>>>    Matt
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Is there something else that needs to be done as part of the
> configure
> >>>>>>>>>> process to enable a decent scaling ? I am only comparing
> programs with
> >>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking
> approximately the
> >>>>>>>>>> same time as noted from -log_summary. If it helps, I've been
> testing
> >>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a
> custom
> >>>>>>>>>> -grid parameter from command-line to control the number of
> unknowns.
> >>>>>>>>>>
> >>>>>>>>>> If there is something you've witnessed before in this
> configuration or
> >>>>>>>>>> if you need anything else to analyze the problem, do let me
> know.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Vijay
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> What most experimenters take for granted before they begin their
> >>>>>>>>> experiments
> >>>>>>>>> is infinitely more interesting than any results to which their
> >>>>>>>>> experiments
> >>>>>>>>> lead.
> >>>>>>>>> -- Norbert Wiener
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> What most experimenters take for granted before they begin their
> experiments
> >>>>>>> is infinitely more interesting than any results to which their
> experiments
> >>>>>>> lead.
> >>>>>>> -- Norbert Wiener
> >>>>>>>
> >>>>>
> >>>>>
> >>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
> >>>
> >>>
> >
> >
>



-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/0b1789c0/attachment.htm>


More information about the petsc-users mailing list