[petsc-users] Enquiry regarding log summary results

Matthew Knepley knepley at gmail.com
Thu Oct 4 14:49:11 CDT 2012


On Thu, Oct 4, 2012 at 3:16 PM, Wee-Beng Tay <zonexo at gmail.com> wrote:

>  On 4/10/2012 5:11 PM, Matthew Knepley wrote:
>
> On Thu, Oct 4, 2012 at 11:01 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>
>>  On 4/10/2012 3:40 AM, Matthew Knepley wrote:
>>
>> On Wed, Oct 3, 2012 at 4:05 PM, TAY wee-beng <zonexo at gmail.com> wrote:
>>
>>>  Hi Jed,
>>>
>>> I believe they are real cores. Anyway, I have attached the log summary
>>> for the 12/24/48 cores. I re-run a smaller case because the large problem
>>> can't run with 12cores.
>>>
>>
>>  Okay, look at VecScatterBegin/End for 24 and 48 cores (I am guessing
>> you have 4 16-core chips, but please figure this out).
>> The messages are logged in ScatterBegin, and the time is logged in
>> ScatterEnd. From 24 to 48 cores the time is cut in half.
>> If you were only communicating the boundary, this is completely
>> backwards, so you are communicating a fair fraction of ALL
>> the values in a subdomain. Figure out why your partition is so screwed up
>> and this will go away.
>>
>>
>> What do you mean by "If you were only communicating the boundary, this is
>> completely backwards, so you are communicating a fair fraction of ALL the
>> values in a subdomain"?
>>
>
>  If you have 48 partitions instead of 24, you have a larger interface, so
> AssemblyEnd() should take
> slightly longer. However, your AssemblyEnd() takes HALF the time, which
> means its communicating
> much fewer values, which means you are not sending interface values, you
> are sending interior values,
> since the interior shrinks when you have more partitions.
>
>  What this probably means is that your assembly routines are screwed up,
> and sending data all over the place.
>
>   Ok I got it now. Looking at the AssemblyEnd time,
>

No no no no no. You are looking at completely wrong numbers. You MUST look
at the Momentum-Z stage.

   Matt


> 12 procs
>
> MatAssemblyEnd       145 1.0 1.6342e+01 1.8 0.00e+00 0.0 4.4e+01 6.0e+04
> 8.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> VecAssemblyEnd       388 1.0 1.4472e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> 24 procs
>
> MatAssemblyEnd       145 1.0 1.1618e+01 2.4 0.00e+00 0.0 9.2e+01 6.0e+04
> 8.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> VecAssemblyEnd       388 1.0 2.3527e-03 2.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> 48 procs
>
> MatAssemblyEnd       145 1.0 7.4327e+00 2.4 0.00e+00 0.0 1.9e+02 6.0e+04
> 8.0e+00  0  0  0  0  0   0  0  0  0  0
>
>
> VecAssemblyEnd       388 1.0 2.8818e-03 3.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>
> VecAssemblyEnd time increases with procs, does it mean that there is
> nothing wrong with it?
>
> On the other hand, MatAssemblyEnd time decreases with procs. So that's
> where the problem lies, is that so?
>
> I'm still scanning my code but haven't found the error yet. It seems
> strange because I inserted the matrix and vector exactly the same way for
> x,y,z. The u,v,w are also allocated with the same indices. Shouldn't the
> error be the same for x, y and z too?
>
> Trying to get some hints as to where else I need to look in my code...
>
> Tks
>
>
>
>
>
>
>     Matt
>
>
>>  I partition my domain in the z direction, as shown in the attached pic.
>> The circled region is where the airfoils are. I'm using an immersed
>> boundary method (IBM) code so the grid is all Cartesian.
>>
>> I created my Z matrix using:
>>
>> call
>> MatCreateAIJ(MPI_COMM_WORLD,ijk_end-ijk_sta,ijk_end-ijk_sta,PETSC_DECIDE,PETSC_DECIDE,7,PETSC_NULL_INTEGER,7,PETSC_NULL_INTEGER,A_semi_z,ierr)
>>
>> where ijk_sta / ijk_end are the starting/ending global indices of the row.
>>
>> 7 is because the star-stencil is used in 3D.
>>
>> I create my RHS vector using:
>>
>> *call
>> VecCreateMPI(MPI_COMM_WORLD,ijk_end-ijk_sta,PETSC_DECIDE,b_rhs_semi_z,ierr)
>> *
>>
>> The values for the matrix and vector were calculated before PETSc logging
>> so they don't come into play.
>>
>> They are also done in a similar fashion for matrix x and y. I still can't
>> get it why solving the z momentum eqn takes so much time. Which portion
>> should I focus on?
>>
>> Tks!
>>
>>
>>     Matt
>>
>>
>>>  Yours sincerely,
>>>
>>> TAY wee-beng
>>>
>>>  On 3/10/2012 5:59 PM, Jed Brown wrote:
>>>
>>> There is an inordinate amount of time being spent in VecScatterEnd().
>>> That sometimes indicates a very bad partition. Also, are your "48 cores"
>>> real physical cores or just "logical cores" (look like cores to the
>>> operating system, usually advertised as "threads" by the vendor, nothing
>>> like cores in reality)? That can cause a huge load imbalance and very
>>> confusing results as over-subscribed threads compete for shared resources.
>>> Step it back to 24 threads and 12 threads, send log_summary for each.
>>>
>>> On Wed, Oct 3, 2012 at 8:08 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>
>>>>  On 2/10/2012 2:43 PM, Jed Brown wrote:
>>>>
>>>> On Tue, Oct 2, 2012 at 8:35 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>>>
>>>>>  Hi,
>>>>>
>>>>> I have combined the momentum linear eqns involving x,y,z into 1 large
>>>>> matrix. The Poisson eqn is solved using HYPRE strcut format so it's not
>>>>> included. I run the code for 50 timesteps (hence 50 kspsolve) using 96
>>>>> procs. The log_summary is given below. I have some questions:
>>>>>
>>>>> 1. After combining the matrix, I should have only 1 PETSc matrix. Why
>>>>> does it says there are 4 matrix, 12 vector etc?
>>>>>
>>>>
>>>>  They are part of preconditioning. Are you sure you're using Hypre for
>>>> this? It looks like you are using bjacobi/ilu.
>>>>
>>>>
>>>>>
>>>>> 2. I'm looking at the stages which take the longest time. It seems
>>>>> that MatAssemblyBegin, VecNorm, VecAssemblyBegin, VecScatterEnd have very
>>>>> high ratios. The ratios of some others are also not too good (~ 1.6 - 2).
>>>>> So are these stages the reason why my code is not scaling well? What can I
>>>>> do to improve it?
>>>>>
>>>>
>>>>  3/4 of the solve time is evenly balanced between MatMult, MatSolve,
>>>> MatLUFactorNumeric, and VecNorm+VecDot.
>>>>
>>>>  The high VecAssembly time might be due to generating a lot of entries
>>>> off-process?
>>>>
>>>>  In any case, this looks like an _extremely_ slow network, perhaps
>>>> it's misconfigured?
>>>>
>>>>
>>>>  My cluster is configured with 48 procs per node. I re-run the case,
>>>> using only 48 procs, thus there's no need to pass over a 'slow'
>>>> interconnect. I'm now also using GAMG and BCGS for the poisson and momentum
>>>> eqn respectively. I have also separated the x,y,z component of the momentum
>>>> eqn to 3 separate linear eqns to debug the problem.
>>>>
>>>> Results show that stage "momentum_z" is taking a lot of time. I wonder
>>>> if it has to do with the fact that I am partitioning my grids in the z
>>>> direction. VecScatterEnd, MatMult are taking a lot of time. VecNormalize,
>>>> VecScatterEnd, VecNorm, VecAssemblyBegin 's ratio are also not good.
>>>>
>>>> I wonder why a lot of entries are generated off-process.
>>>>
>>>> I create my RHS vector using:
>>>>
>>>> *call
>>>> VecCreateMPI(MPI_COMM_WORLD,ijk_xyz_end-ijk_xyz_sta,PETSC_DECIDE,b_rhs_semi_z,ierr)
>>>> *
>>>>
>>>> where ijk_xyz_sta and ijk_xyz_end are obtained from
>>>>
>>>> *call MatGetOwnershipRange(A_semi_z,ijk_xyz_sta,ijk_xyz_end,ierr)*
>>>>
>>>> I then insert the values into the vector using:
>>>>
>>>> *call VecSetValues(b_rhs_semi_z , ijk_xyz_end - ijk_xyz_sta ,
>>>> (/ijk_xyz_sta : ijk_xyz_end - 1/) , q_semi_vect_z(ijk_xyz_sta + 1 :
>>>> ijk_xyz_end) , INSERT_VALUES , ierr)*
>>>>
>>>> What should I do to correct the problem?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Btw, I insert matrix using:
>>>>>
>>>>> *do ijk=ijk_xyz_sta+1,ijk_xyz_end**
>>>>> **
>>>>> **    II = ijk - 1**    !Fortran shift to 0-based**
>>>>> **    **
>>>>> **    call
>>>>> MatSetValues(A_semi_xyz,1,II,7,int_semi_xyz(ijk,1:7),semi_mat_xyz(ijk,1:7),INSERT_VALUES,ierr)
>>>>> **
>>>>> **
>>>>> **end do*
>>>>>
>>>>> where ijk_xyz_sta/ijk_xyz_end are the starting/end index
>>>>>
>>>>> int_semi_xyz(ijk,1:7) stores the 7 column global indices
>>>>>
>>>>> semi_mat_xyz has the corresponding values.
>>>>>
>>>>> and I insert vectors using:
>>>>>
>>>>> call
>>>>> VecSetValues(b_rhs_semi_xyz,ijk_xyz_end_mz-ijk_xyz_sta_mz,(/ijk_xyz_sta_mz:ijk_xyz_end_mz-1/),q_semi_vect_xyz(ijk_xyz_sta_mz+1:ijk_xyz_end_mz),INSERT_VALUES,ierr)
>>>>>
>>>>> Thanks!
>>>>>
>>>>> *
>>>>> *
>>>>>
>>>>> Yours sincerely,
>>>>>
>>>>> TAY wee-beng
>>>>>
>>>>>  On 30/9/2012 11:30 PM, Jed Brown wrote:
>>>>>
>>>>> You can measure the time spent in Hypre via PCApply and PCSetUp, but
>>>>> you can't get finer grained integrated profiling because it was not set up
>>>>> that way.
>>>>> On Sep 30, 2012 3:26 PM, "TAY wee-beng" <zonexo at gmail.com> wrote:
>>>>>
>>>>>>  On 27/9/2012 1:44 PM, Matthew Knepley wrote:
>>>>>>
>>>>>> On Thu, Sep 27, 2012 at 3:49 AM, TAY wee-beng <zonexo at gmail.com>wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm doing a log summary for my 3d cfd code. I have some questions:
>>>>>>>
>>>>>>> 1. if I'm solving 3 linear equations using ksp, is the result given
>>>>>>> in the log summary the total of the 3 linear eqns' performance? How can I
>>>>>>> get the performance for each individual eqn?
>>>>>>>
>>>>>>
>>>>>>  Use logging stages:
>>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Profiling/PetscLogStagePush.html
>>>>>>
>>>>>>
>>>>>>> 2. If I run my code for 10 time steps, does the log summary gives
>>>>>>> the total or avg performance/ratio?
>>>>>>>
>>>>>>
>>>>>>  Total.
>>>>>>
>>>>>>
>>>>>>> 3. Besides PETSc, I'm also using HYPRE's native geometric MG
>>>>>>> (Struct) to solve my Cartesian's grid CFD poisson eqn. Is there any way I
>>>>>>> can use PETSc's log summary to get HYPRE's performance? If I use boomerAMG
>>>>>>> thru PETSc, can I get its performance?
>>>>>>
>>>>>>
>>>>>>  If you mean flops, only if you count them yourself and tell PETSc
>>>>>> using
>>>>>> http://www.mcs.anl.gov/petsc/petsc-dev/docs/manualpages/Profiling/PetscLogFlops.html
>>>>>>
>>>>>>  This is the disadvantage of using packages that do not properly
>>>>>> monitor things :)
>>>>>>
>>>>>>      Matt
>>>>>>
>>>>>>
>>>>>> So u mean if I use boomerAMG thru PETSc, there is no proper way of
>>>>>> evaluating its performance, beside using PetscLogFlops?
>>>>>>
>>>>>>
>>>>>>> --
>>>>>>> Yours sincerely,
>>>>>>>
>>>>>>> TAY wee-beng
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>  --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>
>
>  --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121004/54ce4f9d/attachment-0001.html>


More information about the petsc-users mailing list