[petsc-users] KSPSolve doesn't seem to scale. (Must be doing something wrong...)

Sat Mar 15 16:31:49 CDT 2014

   Bill,

    It is great that you ran with -info to see that there are not excessive mallocs in vector and matrix assemblies and -ksp_view to show the solver being used but I would recommend doing that in a separate run from the -log_summary because we make no attempt to have -info and -xx_view options optimized for performance.

    To begin analysis I find it is always best not to compare 1 to 2 processors, nor to compare at the highest level of number of processors but instead to compare somewhere in the middle. Hence I look at 2 and 4 processes

  1)   Looking at embarrassingly parallel operations

4procs

VecMAXPY            8677 1.0 6.9120e+00 1.0 8.15e+09 1.0 0.0e+00 0.0e+00 0.0e+00  5 35  0  0  0   6 35  0  0  0  4717
MatSolve            8677 1.0 6.9232e+00 1.1 3.41e+09 1.0 0.0e+00 0.0e+00 0.0e+00  5 15  0  0  0   6 15  0  0  0  1971
MatLUFactorNum     1 1.0 2.5489e-03 1.2 6.53e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1024
VecScale            8677 1.0 2.1447e+01 1.1 2.71e+08 1.0 0.0e+00 0.0e+00 0.0e+00 16  1  0  0  0  19  1  0  0  0    51
VecAXPY              508 1.0 8.9473e-01 1.4 3.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0   142

2procs
VecMAXPY            8341 1.0 9.4324e+00 1.0 1.54e+10 1.0 0.0e+00 0.0e+00 0.0e+00 15 34  0  0  0  23 35  0  0  0  3261
MatSolve            8341 1.0 1.0210e+01 1.0 6.61e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 15  0  0  0  25 15  0  0  0  1294
MatLUFactorNum         1 1.0 4.0622e-03 1.1 1.32e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   650
VecScale            8341 1.0 1.0367e+00 1.3 5.21e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  1006
VecAXPY              502 1.0 3.5317e-02 1.7 6.28e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  3553

These are routines where there is no communication between the MPI processes and no synchronization. Thus in an ideal situation one could hope for the routines to run TWICE as fast.  For the last three operations I calculated the ratio of flop rates as 1.57, 1.52, and 1.44.  Thus I conclude that the 4 MPI processes are sharing memory bandwidth thus you cannot expect to get 2 times speed up. 

But what is going on with VecScale and VecAXPY, why is the performance falling through the floor? I noticed that you are using OpenBLAS so did some poking around in google and found at https://github.com/xianyi/OpenBLAS/wiki/faq#what

If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following.

	• export OPENBLAS_NUM_THREADS=1 in the environment variables. Or
	• Call openblas_set_num_threads(1) in the application on runtime. Or
	• Build OpenBLAS single thread version, e.g. make USE_THREAD=0

Of course you application is not multi-threaded it is MPI parallel but you have the exact same problem, the number of cores is over subscribed with too many threads killing performance of some routines.

So please FORCE OpenBlas to only use a single thread and rerun the 1,2,4, and 8 with -log_summary and without the -info and -xxx_view

2)  I now compare the 4 and 8 process case with MAXPY and Solve

8procs
VecMAXPY            9336 1.0 3.0977e+00 1.0 4.59e+09 1.0 0.0e+00 0.0e+00 0.0e+00  3 35  0  0  0   5 35  0  0  0 11835
MatSolve            9336 1.0 3.0873e+00 1.1 1.82e+09 1.0 0.0e+00 0.0e+00 0.0e+00  3 14  0  0  0   4 14  0  0  0  4716

4procs
VecMAXPY            8677 1.0 6.9120e+00 1.0 8.15e+09 1.0 0.0e+00 0.0e+00 0.0e+00  5 35  0  0  0   6 35  0  0  0  4717
MatSolve            8677 1.0 6.9232e+00 1.1 3.41e+09 1.0 0.0e+00 0.0e+00 0.0e+00  5 15  0  0  0   6 15  0  0  0  1971

What the hey is going on here? The performance more than doubles!  From this I conclude that going from 4 to 8 processes is moving the computation to twice as many physical CPUs that DO NOT share memory bandwidth. 

  A general observation, since p multiple cores on the same physical CPU generally share memory bandwidth when you go from p/2 to p MPI processes on that CPU you will never see a double in performance (perfect speedup) you are actually lucky if you see the 1.5 speed up that you are seeing. Thus as you increase the number of MPI processes to extend to more and more physical CPUs you will see “funny jumps” in your speedup depending on when it is switching to more physical CPUs (and hence more memory bandwidth). Thus is is important to understand “where” the program is actually running. 

  So make the changes I recommend and send us the new set of -log_summary and we may be able to make more observations based on less “cluttered” data.

   Barry

On Mar 14, 2014, at 4:45 PM, William Coirier <William.Coirier at kratosdefense.com> wrote:

> I've written a parallel, finite-volume, transient thermal conduction solver using PETSc primitives, and so far things have been going great. Comparisons to theory for a simple problem (transient conduction in a semi-infinite slab) looks good, but I'm not getting very good parallel scaling behavior with the KSP solver. Whether I use the default KSP/PC or other sensible combinations, the time spent in KSPSolve seems to not scale well at all.
> 
> I seem to have loaded up the problem well enough. The PETSc logging/profiling has been really useful for reworking various code segments, and right now, the bottleneck is KSPSolve, and I can't seem to figure out how to get it to scale properly.
> 
> I'm attaching output produced with -log_summary, -info, -ksp_view and -pc_view all specified on the command line for 1, 2, 4 and 8 processes.
> 
> If you guys have any suggestions, I'd definitely like to hear them! And I apologize in advance if I've done something stupid. All the documentation has been really helpful.
> 
> Thanks in advance...
> 
> Bill Coirier
> 
> --------------------------------------------------------------------------------------------------------------------
> 
> ***NOTICE*** This e-mail and/or the attached documents may contain technical data within the definition of the International Traffic in Arms Regulations and/or Export Administration Regulations, and are subject to the export control laws of the U.S. Government. Transfer of this data by any means to a foreign person, whether in the United States or abroad, without an export license or other approval from the U.S. Department of State or Commerce, as applicable, is prohibited. No portion of this e-mail or its attachment(s) may be reproduced without written consent of Kratos Defense & Security Solutions, Inc. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of any such entity. The information contained in this message and or attachments is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient or believe that you may have received this document in error, please notify the sender and delete this e-mail and any attachments immediately.<out.1><out.2><out.4><out.8>