[petsc-users] Configuring petsc with MPI on ubuntu quad-core

Thu Feb 3 16:00:22 CST 2011

   Hmm, just running the basic version with mpiexec -n 2 processes isn't that useful because there is nothing to make sure they are both running at exactly the same time.  

   I've attached a new version of BasicVersion.c that attempts to synchronize the operations in the two processes using MPI_Barrier()
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BasicVersion.c
Type: application/octet-stream
Size: 5948 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/6717f938/attachment.obj>
-------------- next part --------------
; it is probably not a great way to do it, but better than nothing. Please try that one.

    Thanks

   Barry

On Feb 3, 2011, at 1:41 PM, Vijay S. Mahadevan wrote:

> Barry,
> 
> Thanks for the quick reply. I ran the benchmark/stream/BasicVersion
> for one and two processes and the output are as follows:
> 
> -n 1
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 2000000, Offset = 0
> Total memory required = 45.8 MB.
> Each test is run 50 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 2529 microseconds.
>   (= 2529 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   RMS time     Min time     Max time
> Copy:       10161.8510       0.0032       0.0031       0.0037
> Scale:       9843.6177       0.0034       0.0033       0.0038
> Add:        10656.7114       0.0046       0.0045       0.0053
> Triad:      10799.0448       0.0046       0.0044       0.0054
> 
> -n 2
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 2000000, Offset = 0
> Total memory required = 45.8 MB.
> Each test is run 50 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 4320 microseconds.
>   (= 4320 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   RMS time     Min time     Max time
> Copy:        5739.9704       0.0058       0.0056       0.0063
> Scale:       5839.3617       0.0058       0.0055       0.0062
> Add:         6116.9323       0.0081       0.0078       0.0085
> Triad:       6021.0722       0.0084       0.0080       0.0088
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 2000000, Offset = 0
> Total memory required = 45.8 MB.
> Each test is run 50 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 2954 microseconds.
>   (= 2954 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   RMS time     Min time     Max time
> Copy:        6091.9448       0.0056       0.0053       0.0061
> Scale:       5501.1775       0.0060       0.0058       0.0062
> Add:         5960.4640       0.0084       0.0081       0.0087
> Triad:       5936.2109       0.0083       0.0081       0.0089
> 
> I do not have OpenMP installed and so not sure if you wanted that when
> you said two threads. I also closed most of the applications that were
> open before running these tests and so they should hopefully be
> accurate.
> 
> Vijay
> 
> 
> On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> 
>>  Vljay
>> 
>>   Let's just look at a single embarrassingly parallel computation in the run, this computation has NO communication and uses NO MPI and NO synchronization between processes
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>>  1 process
>> VecMAXPY            3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 0.0e+00 15 20  0  0  0  29 40  0  0  0  1983
>> 
>>  2 processes
>> VecMAXPY            3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 0.0e+00 15 20  0  0  0  31 40  0  0  0  2443
>> 
>>   The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23  which is terrible! Now why would it be so bad (remember you cannot blame MPI)
>> 
>> 1) other processes are running on the machine sucking up memory bandwidth. Make sure no other compute tasks are running during this time.
>> 
>> 2) the single process run is able to use almost all of the hardware memory bandwidth, so introducing the second process cannot increase the performance much. This means this machine is terrible for parallelization of sparse iterative solvers.
>> 
>> 3) the machine is somehow misconfigured (beats me how) so that while the one process job doesn't use more than half of the memory bandwidth, when two processes are run the second process cannot utilize all that additional memory bandwidth.
>> 
>>  In src/benchmarks/streams you can run make test and have it generate a report of how the streams benchmark is able to utilize the memory bandwidth. Run that and send us the output (run with just 2 threads).
>> 
>>   Barry
>> 
>> 
>> On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote:
>> 
>>> Matt,
>>> 
>>> I apologize for the incomplete information. Find attached the
>>> log_summary for all the cases.
>>> 
>>> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with
>>> 2x2GB/2x4GB configuration. I do not know how to decipher the memory
>>> bandwidth with this information but if you need anything more, do let
>>> me know.
>>> 
>>> VIjay
>>> 
>>> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> wrote:
>>>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at gmail.com>
>>>> wrote:
>>>>> 
>>>>> Barry,
>>>>> 
>>>>> Sorry about the delay in the reply. I did not have access to the
>>>>> system to test out what you said, until now.
>>>>> 
>>>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
>>>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5
>>>>> 
>>>>> processor       time
>>>>> 1                      114.2
>>>>> 2                      89.45
>>>>> 4                      81.01
>>>> 
>>>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from
>>>> this data.
>>>> 2) Do you know the memory bandwidth characteristics of this machine? That is
>>>> crucial and
>>>>     you cannot begin to understand speedup on it until you do. Please look
>>>> this up.
>>>> 3) Worrying about specifics of the MPI implementation makes no sense until
>>>> the basics are nailed down.
>>>>    Matt
>>>> 
>>>>> 
>>>>> The scaleup doesn't seem to be optimal, even with two processors. I am
>>>>> wondering if the fault is in the MPI configuration itself. Are these
>>>>> results as you would expect ? I can also send you the log_summary for
>>>>> all cases if that will help.
>>>>> 
>>>>> Vijay
>>>>> 
>>>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>> 
>>>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
>>>>>> 
>>>>>>> Barry,
>>>>>>> 
>>>>>>> I understand what you are saying but which example/options then is the
>>>>>>> best one to compute the scalability in a multi-core machine ? I chose
>>>>>>> the nonlinear diffusion problem specifically because of its inherent
>>>>>>> stiffness that could lead probably provide noticeable scalability in a
>>>>>>> multi-core system. From your experience, do you think there is another
>>>>>>> example program that will demonstrate this much more rigorously or
>>>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with
>>>>>>> ex20.c and that was the original motivation for this thread.
>>>>>> 
>>>>>>   Did you follow my instructions?
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>>> 
>>>>>>> Satish. I configured with --download-mpich now without the
>>>>>>> mpich-device. The results are given above. I will try with the options
>>>>>>> you provided although I dont entirely understand what they mean, which
>>>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
>>>>>>> ?
>>>>>>> 
>>>>>>> Vijay
>>>>>>> 
>>>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>>>> 
>>>>>>>>   Ok, everything makes sense. Looks like you are using two level
>>>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant
>>>>>>>> -mg_coarse_redundant_pc_type lu  This means it is solving the coarse grid
>>>>>>>> problem redundantly on each process (each process is solving the entire
>>>>>>>> coarse grid solve using LU factorization). The time for the factorization is
>>>>>>>> (in the two process case)
>>>>>>>> 
>>>>>>>> MatLUFactorNum        14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 37 41  0  0  0  74 82  0  0  0  1307
>>>>>>>> MatILUFactorSym        7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00
>>>>>>>> 0.0e+00 7.0e+00  0  0  0  0  1   0  0  0  0  2     0
>>>>>>>> 
>>>>>>>> which is 74 percent of the total solve time (and 84 percent of the
>>>>>>>> flops).   When 3/4th of the entire run is not parallel at all you cannot
>>>>>>>> expect much speedup.  If you run with -snes_view it will display exactly the
>>>>>>>> solver being used. You cannot expect to understand the performance if you
>>>>>>>> don't understand what the solver is actually doing. Using a 20 by 20 by 20
>>>>>>>> coarse grid is generally a bad idea since the code spends most of the time
>>>>>>>> there, stick with something like 5 by 5 by 5.
>>>>>>>> 
>>>>>>>>  Suggest running with the default grid and -dmmg_nlevels 5 now the
>>>>>>>> percent in the coarse solve will be a trivial percent of the run time.
>>>>>>>> 
>>>>>>>>  You should get pretty good speed up for 2 processes but not much
>>>>>>>> better speedup for four processes because as Matt noted the computation is
>>>>>>>> memory bandwidth limited;
>>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. Note
>>>>>>>> also that this is running multigrid which is a fast solver, but doesn't
>>>>>>>> parallel scale as well many slow algorithms. For example if you run
>>>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
>>>>>>>> processors but crummy speed.
>>>>>>>> 
>>>>>>>>  Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>>>>>>>> 
>>>>>>>>> Barry,
>>>>>>>>> 
>>>>>>>>> Please find attached the patch for the minor change to control the
>>>>>>>>> number of elements from command line for snes/ex20.c. I know that
>>>>>>>>> this
>>>>>>>>> can be achieved with -grid_x etc from command_line but thought this
>>>>>>>>> just made the typing for the refinement process a little easier. I
>>>>>>>>> apologize if there was any confusion.
>>>>>>>>> 
>>>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2.
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>>> Vijay
>>>>>>>>> 
>>>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>  We need all the information from -log_summary to see what is going
>>>>>>>>>> on.
>>>>>>>>>> 
>>>>>>>>>>  Not sure what -grid 20 means but don't expect any good parallel
>>>>>>>>>> performance with less than at least 10,000 unknowns per process.
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>>> 
>>>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>> 
>>>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs.
>>>>>>>>>>> 
>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20
>>>>>>>>>>> -log_summary
>>>>>>>>>>> 
>>>>>>>>>>>                         Max       Max/Min        Avg      Total
>>>>>>>>>>> Time (sec):           8.452e+00      1.00000   8.452e+00
>>>>>>>>>>> Objects:              1.470e+02      1.00000   1.470e+02
>>>>>>>>>>> Flops:                5.045e+09      1.00000   5.045e+09  5.045e+09
>>>>>>>>>>> Flops/sec:            5.969e+08      1.00000   5.969e+08  5.969e+08
>>>>>>>>>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>>>>>>>>>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>>>>>>>>>>> MPI Reductions:       4.440e+02      1.00000
>>>>>>>>>>> 
>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20
>>>>>>>>>>> -log_summary
>>>>>>>>>>> 
>>>>>>>>>>>                         Max       Max/Min        Avg      Total
>>>>>>>>>>> Time (sec):           7.851e+00      1.00000   7.851e+00
>>>>>>>>>>> Objects:              2.000e+02      1.00000   2.000e+02
>>>>>>>>>>> Flops:                4.670e+09      1.00580   4.657e+09  9.313e+09
>>>>>>>>>>> Flops/sec:            5.948e+08      1.00580   5.931e+08  1.186e+09
>>>>>>>>>>> MPI Messages:         7.965e+02      1.00000   7.965e+02  1.593e+03
>>>>>>>>>>> MPI Message Lengths:  1.412e+07      1.00000   1.773e+04  2.824e+07
>>>>>>>>>>> MPI Reductions:       1.046e+03      1.00000
>>>>>>>>>>> 
>>>>>>>>>>> I am not entirely sure if I can make sense out of that statistic
>>>>>>>>>>> but
>>>>>>>>>>> if there is something more you need, please feel free to let me
>>>>>>>>>>> know.
>>>>>>>>>>> 
>>>>>>>>>>> Vijay
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan
>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Matt,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for
>>>>>>>>>>>>> performance
>>>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a
>>>>>>>>>>>>> single
>>>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors
>>>>>>>>>>>>> take
>>>>>>>>>>>>> approximately the same amount of time for computation of
>>>>>>>>>>>>> solution. But
>>>>>>>>>>>>> I am currently configuring without debugging symbols and shall
>>>>>>>>>>>>> let you
>>>>>>>>>>>>> know what that yields.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On a similar note, is there something extra that needs to be done
>>>>>>>>>>>>> to
>>>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not sure
>>>>>>>>>>>>> if
>>>>>>>>>>>>> this is even related to PETSc but could be an MPI configuration
>>>>>>>>>>>>> option
>>>>>>>>>>>>> that maybe either I or the configure process is missing. All
>>>>>>>>>>>>> ideas are
>>>>>>>>>>>>> much appreciated.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation.
>>>>>>>>>>>> On most
>>>>>>>>>>>> cheap multicore machines, there is a single memory bus, and thus
>>>>>>>>>>>> using more
>>>>>>>>>>>> cores gains you very little extra performance. I still suspect you
>>>>>>>>>>>> are not
>>>>>>>>>>>> actually
>>>>>>>>>>>> running in parallel, because you usually see a small speedup. That
>>>>>>>>>>>> is why I
>>>>>>>>>>>> suggested looking at -log_summary since it tells you how many
>>>>>>>>>>>> processes were
>>>>>>>>>>>> run and breaks down the time.
>>>>>>>>>>>>    Matt
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley
>>>>>>>>>>>>> <knepley at gmail.com> wrote:
>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan
>>>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI
>>>>>>>>>>>>>>> installation to
>>>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
>>>>>>>>>>>>>>> eventhough the configure/make process went through without
>>>>>>>>>>>>>>> problems,
>>>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I
>>>>>>>>>>>>>>> expected.
>>>>>>>>>>>>>>> My configure options are
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
>>>>>>>>>>>>>>> --download-mpich=1
>>>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1
>>>>>>>>>>>>>>> --download-hypre=1
>>>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
>>>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
>>>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) For performance studies, make a build using
>>>>>>>>>>>>>> --with-debugging=0
>>>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>>>>>>>>>>    Matt
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is there something else that needs to be done as part of the
>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing
>>>>>>>>>>>>>>> programs with
>>>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking
>>>>>>>>>>>>>>> approximately the
>>>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been
>>>>>>>>>>>>>>> testing
>>>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a
>>>>>>>>>>>>>>> custom
>>>>>>>>>>>>>>> -grid parameter from command-line to control the number of
>>>>>>>>>>>>>>> unknowns.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If there is something you've witnessed before in this
>>>>>>>>>>>>>>> configuration or
>>>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me
>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>> lead.
>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>>> experiments
>>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>>> experiments
>>>>>>>>>>>> lead.
>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> What most experimenters take for granted before they begin their experiments
>>>> is infinitely more interesting than any results to which their experiments
>>>> lead.
>>>> -- Norbert Wiener
>>>> 
>>> <ex20_np1.out><ex20_np2.out><ex20_np4.out>
>> 
>>