[petsc-users] Enquiry regarding log summary results

TAY wee-beng zonexo at gmail.com
Thu Oct 4 16:38:23 CDT 2012

On 4/10/2012 9:21 PM, Jed Brown wrote:
> Can you send a picture of what your domain looks like and what shape 
> the part owned by a given processor looks like? Best would be to write 
> out the mesh with a variable marking the rank owning each vertex, then 
> do a color plot in Paraview or whatever you use to show the partition.
> VecScatterBegin/End is taking much more time than these, and really a 
> pretty unreasonable amount of time in general.


I have attached my grid. I just use a simple paint software to color a 
particular partition. They are Cartesian grids. The center portion, 
where the wings are (in blue), have much closer spaced grids, due to the 
importance of the boundary layer. Hence although the partitions there 
seem thinner, the cells number for each partition is roughly the same.

As mentioned earlier, the grid is partitioned in the Z direction. Hence, 
the variables are allocated as u(1:size_x,1:size_y,ksta:kend), where 
ksta,kend refer to the starting/ending indices in the z direction. Same 
for v,w etc. I hope it is clear enough now.
> On Thu, Oct 4, 2012 at 2:16 PM, Wee-Beng Tay <zonexo at gmail.com 
> <mailto:zonexo at gmail.com>> wrote:
>     On 4/10/2012 5:11 PM, Matthew Knepley wrote:
>>     On Thu, Oct 4, 2012 at 11:01 AM, TAY wee-beng <zonexo at gmail.com
>>     <mailto:zonexo at gmail.com>> wrote:
>>         On 4/10/2012 3:40 AM, Matthew Knepley wrote:
>>>         On Wed, Oct 3, 2012 at 4:05 PM, TAY wee-beng
>>>         <zonexo at gmail.com <mailto:zonexo at gmail.com>> wrote:
>>>             Hi Jed,
>>>             I believe they are real cores. Anyway, I have attached
>>>             the log summary for the 12/24/48 cores. I re-run a
>>>             smaller case because the large problem can't run with
>>>             12cores.
>>>         Okay, look at VecScatterBegin/End for 24 and 48 cores (I am
>>>         guessing you have 4 16-core chips, but please figure this out).
>>>         The messages are logged in ScatterBegin, and the time is
>>>         logged in ScatterEnd. From 24 to 48 cores the time is cut in
>>>         half.
>>>         If you were only communicating the boundary, this is
>>>         completely backwards, so you are communicating a fair
>>>         fraction of ALL
>>>         the values in a subdomain. Figure out why your partition is
>>>         so screwed up and this will go away.
>>         What do you mean by "If you were only communicating the
>>         boundary, this is completely backwards, so you are
>>         communicating a fair fraction of ALL the values in a subdomain"?
>>     If you have 48 partitions instead of 24, you have a larger
>>     interface, so AssemblyEnd() should take
>>     slightly longer. However, your AssemblyEnd() takes HALF the time,
>>     which means its communicating
>>     much fewer values, which means you are not sending interface
>>     values, you are sending interior values,
>>     since the interior shrinks when you have more partitions.
>>     What this probably means is that your assembly routines are
>>     screwed up, and sending data all over the place.
>     Ok I got it now. Looking at the AssemblyEnd time,
>     12 procs
>     MatAssemblyEnd       145 1.0 1.6342e+01 1.8 0.00e+00 0.0 4.4e+01
>     6.0e+04 8.0e+00  0  0  0  0  0   0  0  0  0  0 0
>     VecAssemblyEnd       388 1.0 1.4472e-03 1.4 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>     24 procs
>     MatAssemblyEnd       145 1.0 1.1618e+01 2.4 0.00e+00 0.0 9.2e+01
>     6.0e+04 8.0e+00  0  0  0  0  0   0  0  0  0  0 0
>     VecAssemblyEnd       388 1.0 2.3527e-03 2.4 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>     48 procs
>     MatAssemblyEnd       145 1.0 7.4327e+00 2.4 0.00e+00 0.0 1.9e+02
>     6.0e+04 8.0e+00  0  0  0  0  0   0  0  0  0  0
>     VecAssemblyEnd       388 1.0 2.8818e-03 3.7 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>     VecAssemblyEnd time increases with procs, does it mean that there
>     is nothing wrong with it?
>     On the other hand, MatAssemblyEnd time decreases with procs. So
>     that's where the problem lies, is that so?
>     I'm still scanning my code but haven't found the error yet. It
>     seems strange because I inserted the matrix and vector exactly the
>     same way for x,y,z. The u,v,w are also allocated with the same
>     indices. Shouldn't the error be the same for x, y and z too?
>     Trying to get some hints as to where else I need to look in my code...
>     Tks
>>        Matt
>>         I partition my domain in the z direction, as shown in the
>>         attached pic. The circled region is where the airfoils are.
>>         I'm using an immersed boundary method (IBM) code so the grid
>>         is all Cartesian.
>>         I created my Z matrix using:
>>         call
>>         MatCreateAIJ(MPI_COMM_WORLD,ijk_end-ijk_sta,ijk_end-ijk_sta,PETSC_DECIDE,PETSC_DECIDE,7,PETSC_NULL_INTEGER,7,PETSC_NULL_INTEGER,A_semi_z,ierr)
>>         where ijk_sta / ijk_end are the starting/ending global
>>         indices of the row.
>>         7 is because the star-stencil is used in 3D.
>>         I create my RHS vector using:
>>         /call
>>         VecCreateMPI(MPI_COMM_WORLD,ijk_end-ijk_sta,PETSC_DECIDE,b_rhs_semi_z,ierr)/
>>         The values for the matrix and vector were calculated before
>>         PETSc logging so they don't come into play.
>>         They are also done in a similar fashion for matrix x and y. I
>>         still can't get it why solving the z momentum eqn takes so
>>         much time. Which portion should I focus on?
>>         Tks!
>>>            Matt
>>>             Yours sincerely,
>>>             TAY wee-beng
>>>             On 3/10/2012 5:59 PM, Jed Brown wrote:
>>>>             There is an inordinate amount of time being spent in
>>>>             VecScatterEnd(). That sometimes indicates a very bad
>>>>             partition. Also, are your "48 cores" real physical
>>>>             cores or just "logical cores" (look like cores to the
>>>>             operating system, usually advertised as "threads" by
>>>>             the vendor, nothing like cores in reality)? That can
>>>>             cause a huge load imbalance and very confusing results
>>>>             as over-subscribed threads compete for shared
>>>>             resources. Step it back to 24 threads and 12 threads,
>>>>             send log_summary for each.
>>>>             On Wed, Oct 3, 2012 at 8:08 AM, TAY wee-beng
>>>>             <zonexo at gmail.com <mailto:zonexo at gmail.com>> wrote:
>>>>                 On 2/10/2012 2:43 PM, Jed Brown wrote:
>>>>>                 On Tue, Oct 2, 2012 at 8:35 AM, TAY wee-beng
>>>>>                 <zonexo at gmail.com <mailto:zonexo at gmail.com>> wrote:
>>>>>                     Hi,
>>>>>                     I have combined the momentum linear eqns
>>>>>                     involving x,y,z into 1 large matrix. The
>>>>>                     Poisson eqn is solved using HYPRE strcut
>>>>>                     format so it's not included. I run the code
>>>>>                     for 50 timesteps (hence 50 kspsolve) using 96
>>>>>                     procs. The log_summary is given below. I have
>>>>>                     some questions:
>>>>>                     1. After combining the matrix, I should have
>>>>>                     only 1 PETSc matrix. Why does it says there
>>>>>                     are 4 matrix, 12 vector etc?
>>>>>                 They are part of preconditioning. Are you sure
>>>>>                 you're using Hypre for this? It looks like you are
>>>>>                 using bjacobi/ilu.
>>>>>                     2. I'm looking at the stages which take the
>>>>>                     longest time. It seems that MatAssemblyBegin,
>>>>>                     VecNorm, VecAssemblyBegin, VecScatterEnd have
>>>>>                     very high ratios. The ratios of some others
>>>>>                     are also not too good (~ 1.6 - 2). So are
>>>>>                     these stages the reason why my code is not
>>>>>                     scaling well? What can I do to improve it?
>>>>>                 3/4 of the solve time is evenly balanced between
>>>>>                 MatMult, MatSolve, MatLUFactorNumeric, and
>>>>>                 VecNorm+VecDot.
>>>>>                 The high VecAssembly time might be due to
>>>>>                 generating a lot of entries off-process?
>>>>>                 In any case, this looks like an _extremely_ slow
>>>>>                 network, perhaps it's misconfigured?
>>>>                 My cluster is configured with 48 procs per node. I
>>>>                 re-run the case, using only 48 procs, thus there's
>>>>                 no need to pass over a 'slow' interconnect. I'm now
>>>>                 also using GAMG and BCGS for the poisson and
>>>>                 momentum eqn respectively. I have also separated
>>>>                 the x,y,z component of the momentum eqn to 3
>>>>                 separate linear eqns to debug the problem.
>>>>                 Results show that stage "momentum_z" is taking a
>>>>                 lot of time. I wonder if it has to do with the fact
>>>>                 that I am partitioning my grids in the z direction.
>>>>                 VecScatterEnd, MatMult are taking a lot of time.
>>>>                 VecNormalize, VecScatterEnd, VecNorm,
>>>>                 VecAssemblyBegin 's ratio are also not good.
>>>>                 I wonder why a lot of entries are generated
>>>>                 off-process.
>>>>                 I create my RHS vector using:
>>>>                 /call
>>>>                 VecCreateMPI(MPI_COMM_WORLD,ijk_xyz_end-ijk_xyz_sta,PETSC_DECIDE,b_rhs_semi_z,ierr)/
>>>>                 where ijk_xyz_sta and ijk_xyz_end are obtained from
>>>>                 /call
>>>>                 MatGetOwnershipRange(A_semi_z,ijk_xyz_sta,ijk_xyz_end,ierr)/
>>>>                 I then insert the values into the vector using:
>>>>                 /call VecSetValues(b_rhs_semi_z , ijk_xyz_end -
>>>>                 ijk_xyz_sta , (/ijk_xyz_sta : ijk_xyz_end - 1/) ,
>>>>                 q_semi_vect_z(ijk_xyz_sta + 1 : ijk_xyz_end) ,
>>>>                 INSERT_VALUES , ierr)/
>>>>                 What should I do to correct the problem?
>>>>                 Thanks
>>>>>                     Btw, I insert matrix using:
>>>>>                     /do ijk=ijk_xyz_sta+1,ijk_xyz_end//
>>>>>                     //
>>>>>                     //    II = ijk - 1//!Fortran shift to 0-based//
>>>>>                     ////
>>>>>                     //call
>>>>>                     MatSetValues(A_semi_xyz,1,II,7,int_semi_xyz(ijk,1:7),semi_mat_xyz(ijk,1:7),INSERT_VALUES,ierr)//
>>>>>                     //
>>>>>                     //end do/
>>>>>                     where ijk_xyz_sta/ijk_xyz_end are the
>>>>>                     starting/end index
>>>>>                     int_semi_xyz(ijk,1:7) stores the 7 column
>>>>>                     global indices
>>>>>                     semi_mat_xyz has the corresponding values.
>>>>>                     and I insert vectors using:
>>>>>                     call
>>>>>                     VecSetValues(b_rhs_semi_xyz,ijk_xyz_end_mz-ijk_xyz_sta_mz,(/ijk_xyz_sta_mz:ijk_xyz_end_mz-1/),q_semi_vect_xyz(ijk_xyz_sta_mz+1:ijk_xyz_end_mz),INSERT_VALUES,ierr)
>>>>>                     Thanks!
>>>>>                     /
>>>>>                     /
>>>>>                     Yours sincerely,
>>>>>                     TAY wee-beng
>>>>>                     On 30/9/2012 11:30 PM, Jed Brown wrote:
>>>>>>                     You can measure the time spent in Hypre via
>>>>>>                     PCApply and PCSetUp, but you can't get finer
>>>>>>                     grained integrated profiling because it was
>>>>>>                     not set up that way.
>>>>>>                     On Sep 30, 2012 3:26 PM, "TAY wee-beng"
>>>>>>                     <zonexo at gmail.com <mailto:zonexo at gmail.com>>
>>>>>>                     wrote:
>>>>>>                         On 27/9/2012 1:44 PM, Matthew Knepley wrote:
>>>>>>>                         On Thu, Sep 27, 2012 at 3:49 AM, TAY
>>>>>>>                         wee-beng <zonexo at gmail.com
>>>>>>>                         <mailto:zonexo at gmail.com>> wrote:
>>>>>>>                             Hi,
>>>>>>>                             I'm doing a log summary for my 3d
>>>>>>>                             cfd code. I have some questions:
>>>>>>>                             1. if I'm solving 3 linear equations
>>>>>>>                             using ksp, is the result given in
>>>>>>>                             the log summary the total of the 3
>>>>>>>                             linear eqns' performance? How can I
>>>>>>>                             get the performance for each
>>>>>>>                             individual eqn?
>>>>>>>                         Use logging stages:
>>>>>>>                         http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Profiling/PetscLogStagePush.html
>>>>>>>                             2. If I run my code for 10 time
>>>>>>>                             steps, does the log summary gives
>>>>>>>                             the total or avg performance/ratio?
>>>>>>>                         Total.
>>>>>>>                             3. Besides PETSc, I'm also using
>>>>>>>                             HYPRE's native geometric MG (Struct)
>>>>>>>                             to solve my Cartesian's grid CFD
>>>>>>>                             poisson eqn. Is there any way I can
>>>>>>>                             use PETSc's log summary to get
>>>>>>>                             HYPRE's performance? If I use
>>>>>>>                             boomerAMG thru PETSc, can I get its
>>>>>>>                             performance?
>>>>>>>                         If you mean flops, only if you count
>>>>>>>                         them yourself and tell PETSc using
>>>>>>>                         http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Profiling/PetscLogFlops.html
>>>>>>>                         This is the disadvantage of using
>>>>>>>                         packages that do not properly monitor
>>>>>>>                         things :)
>>>>>>>                             Matt
>>>>>>                         So u mean if I use boomerAMG thru PETSc,
>>>>>>                         there is no proper way of evaluating its
>>>>>>                         performance, beside using PetscLogFlops?
>>>>>>>                             -- 
>>>>>>>                             Yours sincerely,
>>>>>>>                             TAY wee-beng
>>>>>>>                         -- 
>>>>>>>                         What most experimenters take for granted
>>>>>>>                         before they begin their experiments is
>>>>>>>                         infinitely more interesting than any
>>>>>>>                         results to which their experiments lead.
>>>>>>>                         -- Norbert Wiener
>>>         -- 
>>>         What most experimenters take for granted before they begin
>>>         their experiments is infinitely more interesting than any
>>>         results to which their experiments lead.
>>>         -- Norbert Wiener
>>     -- 
>>     What most experimenters take for granted before they begin their
>>     experiments is infinitely more interesting than any results to
>>     which their experiments lead.
>>     -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121004/f2a896a4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3d_grid.jpg
Type: image/jpeg
Size: 160251 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121004/f2a896a4/attachment-0001.jpg>

More information about the petsc-users mailing list