[petsc-users] Configuring petsc with MPI on ubuntu quad-core
Barry Smith
bsmith at mcs.anl.gov
Wed Feb 2 18:35:09 CST 2011
Ok, everything makes sense. Looks like you are using two level multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant -mg_coarse_redundant_pc_type lu This means it is solving the coarse grid problem redundantly on each process (each process is solving the entire coarse grid solve using LU factorization). The time for the factorization is (in the two process case)
MatLUFactorNum 14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 0.0e+00 37 41 0 0 0 74 82 0 0 0 1307
MatILUFactorSym 7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 7.0e+00 0 0 0 0 1 0 0 0 0 2 0
which is 74 percent of the total solve time (and 84 percent of the flops). When 3/4th of the entire run is not parallel at all you cannot expect much speedup. If you run with -snes_view it will display exactly the solver being used. You cannot expect to understand the performance if you don't understand what the solver is actually doing. Using a 20 by 20 by 20 coarse grid is generally a bad idea since the code spends most of the time there, stick with something like 5 by 5 by 5.
Suggest running with the default grid and -dmmg_nlevels 5 now the percent in the coarse solve will be a trivial percent of the run time.
You should get pretty good speed up for 2 processes but not much better speedup for four processes because as Matt noted the computation is memory bandwidth limited; http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. Note also that this is running multigrid which is a fast solver, but doesn't parallel scale as well many slow algorithms. For example if you run -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 processors but crummy speed.
Barry
On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
> Barry,
>
> Please find attached the patch for the minor change to control the
> number of elements from command line for snes/ex20.c. I know that this
> can be achieved with -grid_x etc from command_line but thought this
> just made the typing for the refinement process a little easier. I
> apologize if there was any confusion.
>
> Also, find attached the full log summaries for -np=1 and -np=2. Thanks.
>
> Vijay
>
> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>> We need all the information from -log_summary to see what is going on.
>>
>> Not sure what -grid 20 means but don't expect any good parallel performance with less than at least 10,000 unknowns per process.
>>
>> Barry
>>
>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>
>>> Here's the performance statistic on 1 and 2 processor runs.
>>>
>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 -log_summary
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 8.452e+00 1.00000 8.452e+00
>>> Objects: 1.470e+02 1.00000 1.470e+02
>>> Flops: 5.045e+09 1.00000 5.045e+09 5.045e+09
>>> Flops/sec: 5.969e+08 1.00000 5.969e+08 5.969e+08
>>> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
>>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
>>> MPI Reductions: 4.440e+02 1.00000
>>>
>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 -log_summary
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 7.851e+00 1.00000 7.851e+00
>>> Objects: 2.000e+02 1.00000 2.000e+02
>>> Flops: 4.670e+09 1.00580 4.657e+09 9.313e+09
>>> Flops/sec: 5.948e+08 1.00580 5.931e+08 1.186e+09
>>> MPI Messages: 7.965e+02 1.00000 7.965e+02 1.593e+03
>>> MPI Message Lengths: 1.412e+07 1.00000 1.773e+04 2.824e+07
>>> MPI Reductions: 1.046e+03 1.00000
>>>
>>> I am not entirely sure if I can make sense out of that statistic but
>>> if there is something more you need, please feel free to let me know.
>>>
>>> Vijay
>>>
>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <vijay.m at gmail.com>
>>>> wrote:
>>>>>
>>>>> Matt,
>>>>>
>>>>> The -with-debugging=1 option is certainly not meant for performance
>>>>> studies but I didn't expect it to yield the same cpu time as a single
>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take
>>>>> approximately the same amount of time for computation of solution. But
>>>>> I am currently configuring without debugging symbols and shall let you
>>>>> know what that yields.
>>>>>
>>>>> On a similar note, is there something extra that needs to be done to
>>>>> make use of multi-core machines while using MPI ? I am not sure if
>>>>> this is even related to PETSc but could be an MPI configuration option
>>>>> that maybe either I or the configure process is missing. All ideas are
>>>>> much appreciated.
>>>>
>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On most
>>>> cheap multicore machines, there is a single memory bus, and thus using more
>>>> cores gains you very little extra performance. I still suspect you are not
>>>> actually
>>>> running in parallel, because you usually see a small speedup. That is why I
>>>> suggested looking at -log_summary since it tells you how many processes were
>>>> run and breaks down the time.
>>>> Matt
>>>>
>>>>>
>>>>> Vijay
>>>>>
>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> wrote:
>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <vijay.m at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to configure my petsc install with an MPI installation to
>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
>>>>>>> eventhough the configure/make process went through without problems,
>>>>>>> the scalability of the programs don't seem to reflect what I expected.
>>>>>>> My configure options are
>>>>>>>
>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ --download-mpich=1
>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1
>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>
>>>>>> 1) For performance studies, make a build using --with-debugging=0
>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>> Matt
>>>>>>
>>>>>>>
>>>>>>> Is there something else that needs to be done as part of the configure
>>>>>>> process to enable a decent scaling ? I am only comparing programs with
>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately the
>>>>>>> same time as noted from -log_summary. If it helps, I've been testing
>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom
>>>>>>> -grid parameter from command-line to control the number of unknowns.
>>>>>>>
>>>>>>> If there is something you've witnessed before in this configuration or
>>>>>>> if you need anything else to analyze the problem, do let me know.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vijay
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments
>>>>>> is infinitely more interesting than any results to which their
>>>>>> experiments
>>>>>> lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their experiments
>>>> is infinitely more interesting than any results to which their experiments
>>>> lead.
>>>> -- Norbert Wiener
>>>>
>>
>>
> <ex20.patch><ex20_np1.out><ex20_np2.out>
More information about the petsc-users
mailing list