[petsc-users] Configuring petsc with MPI on ubuntu quad-core

Thu Feb 3 12:05:15 CST 2011

Matt,

I apologize for the incomplete information. Find attached the
log_summary for all the cases.

The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with
2x2GB/2x4GB configuration. I do not know how to decipher the memory
bandwidth with this information but if you need anything more, do let
me know.

VIjay

On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> wrote:
> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at gmail.com>
> wrote:
>>
>> Barry,
>>
>> Sorry about the delay in the reply. I did not have access to the
>> system to test out what you said, until now.
>>
>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5
>>
>> processor       time
>> 1                      114.2
>> 2                      89.45
>> 4                      81.01
>
> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from
> this data.
> 2) Do you know the memory bandwidth characteristics of this machine? That is
> crucial and
>     you cannot begin to understand speedup on it until you do. Please look
> this up.
> 3) Worrying about specifics of the MPI implementation makes no sense until
> the basics are nailed down.
>    Matt
>
>>
>> The scaleup doesn't seem to be optimal, even with two processors. I am
>> wondering if the fault is in the MPI configuration itself. Are these
>> results as you would expect ? I can also send you the log_summary for
>> all cases if that will help.
>>
>> Vijay
>>
>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> >
>> > On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
>> >
>> >> Barry,
>> >>
>> >> I understand what you are saying but which example/options then is the
>> >> best one to compute the scalability in a multi-core machine ? I chose
>> >> the nonlinear diffusion problem specifically because of its inherent
>> >> stiffness that could lead probably provide noticeable scalability in a
>> >> multi-core system. From your experience, do you think there is another
>> >> example program that will demonstrate this much more rigorously or
>> >> clearly ? Btw, I dont get good speedup even for 2 processes with
>> >> ex20.c and that was the original motivation for this thread.
>> >
>> >   Did you follow my instructions?
>> >
>> >   Barry
>> >
>> >>
>> >> Satish. I configured with --download-mpich now without the
>> >> mpich-device. The results are given above. I will try with the options
>> >> you provided although I dont entirely understand what they mean, which
>> >> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
>> >> ?
>> >>
>> >> Vijay
>> >>
>> >> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> >>>
>> >>>   Ok, everything makes sense. Looks like you are using two level
>> >>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant
>> >>> -mg_coarse_redundant_pc_type lu  This means it is solving the coarse grid
>> >>> problem redundantly on each process (each process is solving the entire
>> >>> coarse grid solve using LU factorization). The time for the factorization is
>> >>> (in the two process case)
>> >>>
>> >>> MatLUFactorNum        14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00
>> >>> 0.0e+00 0.0e+00 37 41  0  0  0  74 82  0  0  0  1307
>> >>> MatILUFactorSym        7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00
>> >>> 0.0e+00 7.0e+00  0  0  0  0  1   0  0  0  0  2     0
>> >>>
>> >>> which is 74 percent of the total solve time (and 84 percent of the
>> >>> flops).   When 3/4th of the entire run is not parallel at all you cannot
>> >>> expect much speedup.  If you run with -snes_view it will display exactly the
>> >>> solver being used. You cannot expect to understand the performance if you
>> >>> don't understand what the solver is actually doing. Using a 20 by 20 by 20
>> >>> coarse grid is generally a bad idea since the code spends most of the time
>> >>> there, stick with something like 5 by 5 by 5.
>> >>>
>> >>>  Suggest running with the default grid and -dmmg_nlevels 5 now the
>> >>> percent in the coarse solve will be a trivial percent of the run time.
>> >>>
>> >>>  You should get pretty good speed up for 2 processes but not much
>> >>> better speedup for four processes because as Matt noted the computation is
>> >>> memory bandwidth limited;
>> >>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. Note
>> >>> also that this is running multigrid which is a fast solver, but doesn't
>> >>> parallel scale as well many slow algorithms. For example if you run
>> >>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
>> >>> processors but crummy speed.
>> >>>
>> >>>  Barry
>> >>>
>> >>>
>> >>>
>> >>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>> >>>
>> >>>> Barry,
>> >>>>
>> >>>> Please find attached the patch for the minor change to control the
>> >>>> number of elements from command line for snes/ex20.c. I know that
>> >>>> this
>> >>>> can be achieved with -grid_x etc from command_line but thought this
>> >>>> just made the typing for the refinement process a little easier. I
>> >>>> apologize if there was any confusion.
>> >>>>
>> >>>> Also, find attached the full log summaries for -np=1 and -np=2.
>> >>>> Thanks.
>> >>>>
>> >>>> Vijay
>> >>>>
>> >>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov>
>> >>>> wrote:
>> >>>>>
>> >>>>>  We need all the information from -log_summary to see what is going
>> >>>>> on.
>> >>>>>
>> >>>>>  Not sure what -grid 20 means but don't expect any good parallel
>> >>>>> performance with less than at least 10,000 unknowns per process.
>> >>>>>
>> >>>>>   Barry
>> >>>>>
>> >>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>> >>>>>
>> >>>>>> Here's the performance statistic on 1 and 2 processor runs.
>> >>>>>>
>> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20
>> >>>>>> -log_summary
>> >>>>>>
>> >>>>>>                         Max       Max/Min        Avg      Total
>> >>>>>> Time (sec):           8.452e+00      1.00000   8.452e+00
>> >>>>>> Objects:              1.470e+02      1.00000   1.470e+02
>> >>>>>> Flops:                5.045e+09      1.00000   5.045e+09  5.045e+09
>> >>>>>> Flops/sec:            5.969e+08      1.00000   5.969e+08  5.969e+08
>> >>>>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>> >>>>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>> >>>>>> MPI Reductions:       4.440e+02      1.00000
>> >>>>>>
>> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20
>> >>>>>> -log_summary
>> >>>>>>
>> >>>>>>                         Max       Max/Min        Avg      Total
>> >>>>>> Time (sec):           7.851e+00      1.00000   7.851e+00
>> >>>>>> Objects:              2.000e+02      1.00000   2.000e+02
>> >>>>>> Flops:                4.670e+09      1.00580   4.657e+09  9.313e+09
>> >>>>>> Flops/sec:            5.948e+08      1.00580   5.931e+08  1.186e+09
>> >>>>>> MPI Messages:         7.965e+02      1.00000   7.965e+02  1.593e+03
>> >>>>>> MPI Message Lengths:  1.412e+07      1.00000   1.773e+04  2.824e+07
>> >>>>>> MPI Reductions:       1.046e+03      1.00000
>> >>>>>>
>> >>>>>> I am not entirely sure if I can make sense out of that statistic
>> >>>>>> but
>> >>>>>> if there is something more you need, please feel free to let me
>> >>>>>> know.
>> >>>>>>
>> >>>>>> Vijay
>> >>>>>>
>> >>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com>
>> >>>>>> wrote:
>> >>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan
>> >>>>>>> <vijay.m at gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Matt,
>> >>>>>>>>
>> >>>>>>>> The -with-debugging=1 option is certainly not meant for
>> >>>>>>>> performance
>> >>>>>>>> studies but I didn't expect it to yield the same cpu time as a
>> >>>>>>>> single
>> >>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors
>> >>>>>>>> take
>> >>>>>>>> approximately the same amount of time for computation of
>> >>>>>>>> solution. But
>> >>>>>>>> I am currently configuring without debugging symbols and shall
>> >>>>>>>> let you
>> >>>>>>>> know what that yields.
>> >>>>>>>>
>> >>>>>>>> On a similar note, is there something extra that needs to be done
>> >>>>>>>> to
>> >>>>>>>> make use of multi-core machines while using MPI ? I am not sure
>> >>>>>>>> if
>> >>>>>>>> this is even related to PETSc but could be an MPI configuration
>> >>>>>>>> option
>> >>>>>>>> that maybe either I or the configure process is missing. All
>> >>>>>>>> ideas are
>> >>>>>>>> much appreciated.
>> >>>>>>>
>> >>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation.
>> >>>>>>> On most
>> >>>>>>> cheap multicore machines, there is a single memory bus, and thus
>> >>>>>>> using more
>> >>>>>>> cores gains you very little extra performance. I still suspect you
>> >>>>>>> are not
>> >>>>>>> actually
>> >>>>>>> running in parallel, because you usually see a small speedup. That
>> >>>>>>> is why I
>> >>>>>>> suggested looking at -log_summary since it tells you how many
>> >>>>>>> processes were
>> >>>>>>> run and breaks down the time.
>> >>>>>>>    Matt
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> Vijay
>> >>>>>>>>
>> >>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley
>> >>>>>>>> <knepley at gmail.com> wrote:
>> >>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan
>> >>>>>>>>> <vijay.m at gmail.com>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> I am trying to configure my petsc install with an MPI
>> >>>>>>>>>> installation to
>> >>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
>> >>>>>>>>>> eventhough the configure/make process went through without
>> >>>>>>>>>> problems,
>> >>>>>>>>>> the scalability of the programs don't seem to reflect what I
>> >>>>>>>>>> expected.
>> >>>>>>>>>> My configure options are
>> >>>>>>>>>>
>> >>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
>> >>>>>>>>>> --download-mpich=1
>> >>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>> >>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1
>> >>>>>>>>>> --download-hypre=1
>> >>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
>> >>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
>> >>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>> >>>>>>>>>
>> >>>>>>>>> 1) For performance studies, make a build using
>> >>>>>>>>> --with-debugging=0
>> >>>>>>>>> 2) Look at -log_summary for a breakdown of performance
>> >>>>>>>>>    Matt
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Is there something else that needs to be done as part of the
>> >>>>>>>>>> configure
>> >>>>>>>>>> process to enable a decent scaling ? I am only comparing
>> >>>>>>>>>> programs with
>> >>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking
>> >>>>>>>>>> approximately the
>> >>>>>>>>>> same time as noted from -log_summary. If it helps, I've been
>> >>>>>>>>>> testing
>> >>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a
>> >>>>>>>>>> custom
>> >>>>>>>>>> -grid parameter from command-line to control the number of
>> >>>>>>>>>> unknowns.
>> >>>>>>>>>>
>> >>>>>>>>>> If there is something you've witnessed before in this
>> >>>>>>>>>> configuration or
>> >>>>>>>>>> if you need anything else to analyze the problem, do let me
>> >>>>>>>>>> know.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks,
>> >>>>>>>>>> Vijay
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> What most experimenters take for granted before they begin their
>> >>>>>>>>> experiments
>> >>>>>>>>> is infinitely more interesting than any results to which their
>> >>>>>>>>> experiments
>> >>>>>>>>> lead.
>> >>>>>>>>> -- Norbert Wiener
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> What most experimenters take for granted before they begin their
>> >>>>>>> experiments
>> >>>>>>> is infinitely more interesting than any results to which their
>> >>>>>>> experiments
>> >>>>>>> lead.
>> >>>>>>> -- Norbert Wiener
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>> >>>
>> >>>
>> >
>> >
>
>
>
> --
> What most experimenters take for granted before they begin their experiments
> is infinitely more interesting than any results to which their experiments
> lead.
> -- Norbert Wiener
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex20_np1.out
Type: application/octet-stream
Size: 12365 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/5f8c5e2d/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex20_np2.out
Type: application/octet-stream
Size: 13469 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/5f8c5e2d/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex20_np4.out
Type: application/octet-stream
Size: 14749 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/5f8c5e2d/attachment-0005.obj>