[petsc-users] Enquiry regarding log summary results

Thu Oct 4 16:40:40 CDT 2012

On Thu, Oct 4, 2012 at 5:38 PM, TAY wee-beng <zonexo at gmail.com> wrote:

>  On 4/10/2012 9:21 PM, Jed Brown wrote:
>
> Can you send a picture of what your domain looks like and what shape the
> part owned by a given processor looks like? Best would be to write out the
> mesh with a variable marking the rank owning each vertex, then do a color
> plot in Paraview or whatever you use to show the partition.
>
>  VecScatterBegin/End is taking much more time than these, and really a
> pretty unreasonable amount of time in general.
>
>
> Hi,
>
> I have attached my grid. I just use a simple paint software to color a
> particular partition. They are Cartesian grids. The center portion, where
> the wings are (in blue), have much closer spaced grids, due to the
> importance of the boundary layer. Hence although the partitions there seem
> thinner, the cells number for each partition is roughly the same.
>
> As mentioned earlier, the grid is partitioned in the Z direction. Hence,
> the variables are allocated as u(1:size_x,1:size_y,ksta:kend), where
> ksta,kend refer to the starting/ending indices in the z direction. Same for
> v,w etc. I hope it is clear enough now.
>

This is way too many emails on this list. As I said before, the Mom-Z solve
is bad because the assembly of the
operator is screwed up. You are communicating too many values. So, jsut go
into your code and count how many
off process entries you set.

   Matt

>  On Thu, Oct 4, 2012 at 2:16 PM, Wee-Beng Tay <zonexo at gmail.com> wrote:
>
>>  On 4/10/2012 5:11 PM, Matthew Knepley wrote:
>>
>> On Thu, Oct 4, 2012 at 11:01 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>
>>>  On 4/10/2012 3:40 AM, Matthew Knepley wrote:
>>>
>>> On Wed, Oct 3, 2012 at 4:05 PM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>
>>>>  Hi Jed,
>>>>
>>>> I believe they are real cores. Anyway, I have attached the log summary
>>>> for the 12/24/48 cores. I re-run a smaller case because the large problem
>>>> can't run with 12cores.
>>>>
>>>
>>>  Okay, look at VecScatterBegin/End for 24 and 48 cores (I am guessing
>>> you have 4 16-core chips, but please figure this out).
>>> The messages are logged in ScatterBegin, and the time is logged in
>>> ScatterEnd. From 24 to 48 cores the time is cut in half.
>>> If you were only communicating the boundary, this is completely
>>> backwards, so you are communicating a fair fraction of ALL
>>> the values in a subdomain. Figure out why your partition is so screwed
>>> up and this will go away.
>>>
>>>
>>> What do you mean by "If you were only communicating the boundary, this
>>> is completely backwards, so you are communicating a fair fraction of ALL
>>> the values in a subdomain"?
>>>
>>
>>  If you have 48 partitions instead of 24, you have a larger interface,
>> so AssemblyEnd() should take
>> slightly longer. However, your AssemblyEnd() takes HALF the time, which
>> means its communicating
>> much fewer values, which means you are not sending interface values, you
>> are sending interior values,
>> since the interior shrinks when you have more partitions.
>>
>>  What this probably means is that your assembly routines are screwed up,
>> and sending data all over the place.
>>
>>   Ok I got it now. Looking at the AssemblyEnd time,
>>
>> 12 procs
>>
>> MatAssemblyEnd       145 1.0 1.6342e+01 1.8 0.00e+00 0.0 4.4e+01 6.0e+04
>> 8.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> VecAssemblyEnd       388 1.0 1.4472e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> 24 procs
>>
>> MatAssemblyEnd       145 1.0 1.1618e+01 2.4 0.00e+00 0.0 9.2e+01 6.0e+04
>> 8.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> VecAssemblyEnd       388 1.0 2.3527e-03 2.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> 48 procs
>>
>> MatAssemblyEnd       145 1.0 7.4327e+00 2.4 0.00e+00 0.0 1.9e+02 6.0e+04
>> 8.0e+00  0  0  0  0  0   0  0  0  0  0
>>
>>
>> VecAssemblyEnd       388 1.0 2.8818e-03 3.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>
>> VecAssemblyEnd time increases with procs, does it mean that there is
>> nothing wrong with it?
>>
>> On the other hand, MatAssemblyEnd time decreases with procs. So that's
>> where the problem lies, is that so?
>>
>> I'm still scanning my code but haven't found the error yet. It seems
>> strange because I inserted the matrix and vector exactly the same way for
>> x,y,z. The u,v,w are also allocated with the same indices. Shouldn't the
>> error be the same for x, y and z too?
>>
>> Trying to get some hints as to where else I need to look in my code...
>>
>> Tks
>>
>>
>>
>>
>>
>>
>>
>>     Matt
>>
>>
>>>  I partition my domain in the z direction, as shown in the attached pic.
>>> The circled region is where the airfoils are. I'm using an immersed
>>> boundary method (IBM) code so the grid is all Cartesian.
>>>
>>> I created my Z matrix using:
>>>
>>> call
>>> MatCreateAIJ(MPI_COMM_WORLD,ijk_end-ijk_sta,ijk_end-ijk_sta,PETSC_DECIDE,PETSC_DECIDE,7,PETSC_NULL_INTEGER,7,PETSC_NULL_INTEGER,A_semi_z,ierr)
>>>
>>> where ijk_sta / ijk_end are the starting/ending global indices of the
>>> row.
>>>
>>> 7 is because the star-stencil is used in 3D.
>>>
>>> I create my RHS vector using:
>>>
>>> *call
>>> VecCreateMPI(MPI_COMM_WORLD,ijk_end-ijk_sta,PETSC_DECIDE,b_rhs_semi_z,ierr)
>>> *
>>>
>>> The values for the matrix and vector were calculated before PETSc
>>> logging so they don't come into play.
>>>
>>> They are also done in a similar fashion for matrix x and y. I still
>>> can't get it why solving the z momentum eqn takes so much time. Which
>>> portion should I focus on?
>>>
>>> Tks!
>>>
>>>
>>>     Matt
>>>
>>>
>>>>  Yours sincerely,
>>>>
>>>> TAY wee-beng
>>>>
>>>>  On 3/10/2012 5:59 PM, Jed Brown wrote:
>>>>
>>>> There is an inordinate amount of time being spent in VecScatterEnd().
>>>> That sometimes indicates a very bad partition. Also, are your "48 cores"
>>>> real physical cores or just "logical cores" (look like cores to the
>>>> operating system, usually advertised as "threads" by the vendor, nothing
>>>> like cores in reality)? That can cause a huge load imbalance and very
>>>> confusing results as over-subscribed threads compete for shared resources.
>>>> Step it back to 24 threads and 12 threads, send log_summary for each.
>>>>
>>>> On Wed, Oct 3, 2012 at 8:08 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>>
>>>>>  On 2/10/2012 2:43 PM, Jed Brown wrote:
>>>>>
>>>>> On Tue, Oct 2, 2012 at 8:35 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>> I have combined the momentum linear eqns involving x,y,z into 1 large
>>>>>> matrix. The Poisson eqn is solved using HYPRE strcut format so it's not
>>>>>> included. I run the code for 50 timesteps (hence 50 kspsolve) using 96
>>>>>> procs. The log_summary is given below. I have some questions:
>>>>>>
>>>>>> 1. After combining the matrix, I should have only 1 PETSc matrix. Why
>>>>>> does it says there are 4 matrix, 12 vector etc?
>>>>>>
>>>>>
>>>>>  They are part of preconditioning. Are you sure you're using Hypre
>>>>> for this? It looks like you are using bjacobi/ilu.
>>>>>
>>>>>
>>>>>>
>>>>>> 2. I'm looking at the stages which take the longest time. It seems
>>>>>> that MatAssemblyBegin, VecNorm, VecAssemblyBegin, VecScatterEnd have very
>>>>>> high ratios. The ratios of some others are also not too good (~ 1.6 - 2).
>>>>>> So are these stages the reason why my code is not scaling well? What can I
>>>>>> do to improve it?
>>>>>>
>>>>>
>>>>>  3/4 of the solve time is evenly balanced between MatMult, MatSolve,
>>>>> MatLUFactorNumeric, and VecNorm+VecDot.
>>>>>
>>>>>  The high VecAssembly time might be due to generating a lot of
>>>>> entries off-process?
>>>>>
>>>>>  In any case, this looks like an _extremely_ slow network, perhaps
>>>>> it's misconfigured?
>>>>>
>>>>>
>>>>>  My cluster is configured with 48 procs per node. I re-run the case,
>>>>> using only 48 procs, thus there's no need to pass over a 'slow'
>>>>> interconnect. I'm now also using GAMG and BCGS for the poisson and momentum
>>>>> eqn respectively. I have also separated the x,y,z component of the momentum
>>>>> eqn to 3 separate linear eqns to debug the problem.
>>>>>
>>>>> Results show that stage "momentum_z" is taking a lot of time. I wonder
>>>>> if it has to do with the fact that I am partitioning my grids in the z
>>>>> direction. VecScatterEnd, MatMult are taking a lot of time. VecNormalize,
>>>>> VecScatterEnd, VecNorm, VecAssemblyBegin 's ratio are also not good.
>>>>>
>>>>> I wonder why a lot of entries are generated off-process.
>>>>>
>>>>> I create my RHS vector using:
>>>>>
>>>>> *call
>>>>> VecCreateMPI(MPI_COMM_WORLD,ijk_xyz_end-ijk_xyz_sta,PETSC_DECIDE,b_rhs_semi_z,ierr)
>>>>> *
>>>>>
>>>>> where ijk_xyz_sta and ijk_xyz_end are obtained from
>>>>>
>>>>> *call MatGetOwnershipRange(A_semi_z,ijk_xyz_sta,ijk_xyz_end,ierr)*
>>>>>
>>>>> I then insert the values into the vector using:
>>>>>
>>>>> *call VecSetValues(b_rhs_semi_z , ijk_xyz_end - ijk_xyz_sta ,
>>>>> (/ijk_xyz_sta : ijk_xyz_end - 1/) , q_semi_vect_z(ijk_xyz_sta + 1 :
>>>>> ijk_xyz_end) , INSERT_VALUES , ierr)*
>>>>>
>>>>> What should I do to correct the problem?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Btw, I insert matrix using:
>>>>>>
>>>>>> *do ijk=ijk_xyz_sta+1,ijk_xyz_end**
>>>>>> **
>>>>>> **    II = ijk - 1**    !Fortran shift to 0-based**
>>>>>> **    **
>>>>>> **    call
>>>>>> MatSetValues(A_semi_xyz,1,II,7,int_semi_xyz(ijk,1:7),semi_mat_xyz(ijk,1:7),INSERT_VALUES,ierr)
>>>>>> **
>>>>>> **
>>>>>> **end do*
>>>>>>
>>>>>> where ijk_xyz_sta/ijk_xyz_end are the starting/end index
>>>>>>
>>>>>> int_semi_xyz(ijk,1:7) stores the 7 column global indices
>>>>>>
>>>>>> semi_mat_xyz has the corresponding values.
>>>>>>
>>>>>> and I insert vectors using:
>>>>>>
>>>>>> call
>>>>>> VecSetValues(b_rhs_semi_xyz,ijk_xyz_end_mz-ijk_xyz_sta_mz,(/ijk_xyz_sta_mz:ijk_xyz_end_mz-1/),q_semi_vect_xyz(ijk_xyz_sta_mz+1:ijk_xyz_end_mz),INSERT_VALUES,ierr)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> *
>>>>>> *
>>>>>>
>>>>>> Yours sincerely,
>>>>>>
>>>>>> TAY wee-beng
>>>>>>
>>>>>>  On 30/9/2012 11:30 PM, Jed Brown wrote:
>>>>>>
>>>>>> You can measure the time spent in Hypre via PCApply and PCSetUp, but
>>>>>> you can't get finer grained integrated profiling because it was not set up
>>>>>> that way.
>>>>>> On Sep 30, 2012 3:26 PM, "TAY wee-beng" <zonexo at gmail.com> wrote:
>>>>>>
>>>>>>>  On 27/9/2012 1:44 PM, Matthew Knepley wrote:
>>>>>>>
>>>>>>> On Thu, Sep 27, 2012 at 3:49 AM, TAY wee-beng <zonexo at gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm doing a log summary for my 3d cfd code. I have some questions:
>>>>>>>>
>>>>>>>> 1. if I'm solving 3 linear equations using ksp, is the result given
>>>>>>>> in the log summary the total of the 3 linear eqns' performance? How can I
>>>>>>>> get the performance for each individual eqn?
>>>>>>>>
>>>>>>>
>>>>>>>  Use logging stages:
>>>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Profiling/PetscLogStagePush.html
>>>>>>>
>>>>>>>
>>>>>>>> 2. If I run my code for 10 time steps, does the log summary gives
>>>>>>>> the total or avg performance/ratio?
>>>>>>>>
>>>>>>>
>>>>>>>  Total.
>>>>>>>
>>>>>>>
>>>>>>>> 3. Besides PETSc, I'm also using HYPRE's native geometric MG
>>>>>>>> (Struct) to solve my Cartesian's grid CFD poisson eqn. Is there any way I
>>>>>>>> can use PETSc's log summary to get HYPRE's performance? If I use boomerAMG
>>>>>>>> thru PETSc, can I get its performance?
>>>>>>>
>>>>>>>
>>>>>>>  If you mean flops, only if you count them yourself and tell PETSc
>>>>>>> using
>>>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Profiling/PetscLogFlops.html
>>>>>>>
>>>>>>>  This is the disadvantage of using packages that do not properly
>>>>>>> monitor things :)
>>>>>>>
>>>>>>>      Matt
>>>>>>>
>>>>>>>
>>>>>>> So u mean if I use boomerAMG thru PETSc, there is no proper way of
>>>>>>> evaluating its performance, beside using PetscLogFlops?
>>>>>>>
>>>>>>>
>>>>>>>> --
>>>>>>>> Yours sincerely,
>>>>>>>>
>>>>>>>> TAY wee-beng
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>> experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>  --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>>
>>>
>>
>>
>>  --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121004/a70148ba/attachment-0001.html>