Slow speed after changing from serial to parallel

Tue Apr 15 10:46:17 CDT 2008

1) Please never cut out parts of the summary. All the information is valuable,
    and most times, necessary

2) You seem to have huge load imbalance (look at VecNorm). Do you partition
    the system yourself. How many processes is this?

3) You seem to be setting a huge number of off-process values in the matrix
    (see MatAssemblyBegin). Is this true? I would reorganize this part.

  Matt

On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
> Hi,
>
>  I have converted the poisson eqn part of the CFD code to parallel. The grid
> size tested is 600x720. For the momentum eqn, I used another serial linear
> solver (nspcg) to prevent mixing of results. Here's the output summary:
>
>  --- Event Stage 0: Main Stage
>
>  MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100  0  10 11100100  0   217
>  MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>  MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>  MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>  MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>  KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
> 1.7e+04 89100100100100  89100100100100   317
>  PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>  PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>  PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>  VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>  *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>  *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>  VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>  VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>  VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
> 0.0e+00  0  0100100  0   0  0100100  0     0*
>  *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>  *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>
> ------------------------------------------------------------------------------------------------------------------------
>   Memory usage is given in bytes:
>   Object Type          Creations   Destructions   Memory  Descendants' Mem.
>     --- Event Stage 0: Main Stage
>                  Matrix     4              4   49227380     0
>       Krylov Solver     2              2      17216     0
>      Preconditioner     2              2        256     0
>           Index Set     5              5    2596120     0
>                 Vec    40             40   62243224     0
>         Vec Scatter     1              1          0     0
> ========================================================================================================================
>  Average time to get PetscTime(): 4.05312e-07                  Average time
> for MPI_Barrier(): 7.62939e-07
>  Average time for zero size MPI_Send(): 2.02656e-06
>  OptionTable: -log_summary
>
>
>  The PETSc manual states that ratio should be close to 1. There's quite a
> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So
> what could be the cause?
>
>  I wonder if it has to do the way I insert the matrix. My steps are:
> (cartesian grids, i loop faster than j, fortran)
>
>  For matrix A and rhs
>
>  Insert left extreme cells values belonging to myid
>
>  if (myid==0) then
>
>    insert corner cells values
>
>    insert south cells values
>
>    insert internal cells values
>
>  else if (myid==num_procs-1) then
>
>    insert corner cells values
>
>    insert north cells values
>
>    insert internal cells values
>
>  else
>
>    insert internal cells values
>
>  end if
>
>  Insert right extreme cells values belonging to myid
>
>  All these values are entered into a big_A(size_x*size_y,5) matrix. int_A
> stores the position of the values. I then do
>
>  call MatZeroEntries(A_mat,ierr)
>
>    do k=ksta_p+1,kend_p   !for cells belonging to myid
>
>        do kk=1,5
>
>            II=k-1
>
>            JJ=int_A(k,kk)-1
>
>            call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>                  end do
>
>    end do
>
>    call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>
>    call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>
>
>  I wonder if the problem lies here.I used the big_A matrix because I was
> migrating from an old linear solver. Lastly, I was told to widen my window
> to 120 characters. May I know how do I do it?
>
>
>
>  Thank you very much.
>
>  Matthew Knepley wrote:
>
> > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
> >
> >
> > > Hi Matthew,
> > >
> > >  I think you've misunderstood what I meant. What I'm trying to say is
> > > initially I've got a serial code. I tried to convert to a parallel one.
> Then
> > > I tested it and it was pretty slow. Due to some work requirement, I need
> to
> > > go back to make some changes to my code. Since the parallel is not
> working
> > > well, I updated and changed the serial one.
> > >
> > >  Well, that was a while ago and now, due to the updates and changes, the
> > > serial code is different from the old converted parallel code. Some
> files
> > > were also deleted and I can't seem to get it working now. So I thought I
> > > might as well convert the new serial code to parallel. But I'm not very
> sure
> > > what I should do 1st.
> > >
> > >  Maybe I should rephrase my question in that if I just convert my
> poisson
> > > equation subroutine from a serial PETSc to a parallel PETSc version,
> will it
> > > work? Should I expect a speedup? The rest of my code is still serial.
> > >
> > >
> >
> > You should, of course, only expect speedup in the parallel parts
> >
> >  Matt
> >
> >
> >
> > >  Thank you very much.
> > >
> > >
> > >
> > >  Matthew Knepley wrote:
> > >
> > >
> > >
> > > > I am not sure why you would ever have two codes. I never do this.
> PETSc
> > > > is designed to write one code to run in serial and parallel. The PETSc
> > > >
> > > >
> > > part
> > >
> > >
> > > > should look identical. To test, run the code yo uhave verified in
> serial
> > > >
> > > >
> > > and
> > >
> > >
> > > > output PETSc data structures (like Mat and Vec) using a binary viewer.
> > > > Then run in parallel with the same code, which will output the same
> > > > structures. Take the two files and write a small verification code
> that
> > > > loads both versions and calls MatEqual and VecEqual.
> > > >
> > > >  Matt
> > > >
> > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > > Thank you Matthew. Sorry to trouble you again.
> > > > >
> > > > >  I tried to run it with -log_summary output and I found that there's
> > > > >
> > > > >
> > > >
> > > some
> > >
> > >
> > > >
> > > > > errors in the execution. Well, I was busy with other things and I
> just
> > > > >
> > > > >
> > > >
> > > came
> > >
> > >
> > > >
> > > > > back to this problem. Some of my files on the server has also been
> > > > >
> > > > >
> > > >
> > > deleted.
> > >
> > >
> > > >
> > > > > It has been a while and I  remember that  it worked before, only
> much
> > > > > slower.
> > > > >
> > > > >  Anyway, most of the serial code has been updated and maybe it's
> easier
> > > > >
> > > > >
> > > >
> > > to
> > >
> > >
> > > >
> > > > > convert the new serial code instead of debugging on the old parallel
> > > > >
> > > > >
> > > >
> > > code
> > >
> > >
> > > >
> > > > > now. I believe I can still reuse part of the old parallel code.
> However,
> > > > >
> > > > >
> > > >
> > > I
> > >
> > >
> > > >
> > > > > hope I can approach it better this time.
> > > > >
> > > > >  So supposed I need to start converting my new serial code to
> parallel.
> > > > > There's 2 eqns to be solved using PETSc, the momentum and poisson. I
> > > > >
> > > > >
> > > >
> > > also
> > >
> > >
> > > >
> > > > > need to parallelize other parts of my code. I wonder which route is
> the
> > > > > best:
> > > > >
> > > > >  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
> > > > >
> > > > >
> > > >
> > > modify
> > >
> > >
> > > >
> > > > > other parts of my code to parallel e.g. looping, updating of values
> etc.
> > > > > Once the execution is fine and speedup is reasonable, then modify
> the
> > > > >
> > > > >
> > > >
> > > PETSc
> > >
> > >
> > > >
> > > > > part - poisson eqn 1st followed by the momentum eqn.
> > > > >
> > > > >  2. Reverse the above order ie modify the PETSc part - poisson eqn
> 1st
> > > > > followed by the momentum eqn. Then do other parts of my code.
> > > > >
> > > > >  I'm not sure if the above 2 mtds can work or if there will be
> > > > >
> > > > >
> > > >
> > > conflicts. Of
> > >
> > >
> > > >
> > > > > course, an alternative will be:
> > > > >
> > > > >  3. Do the poisson, momentum eqns and other parts of the code
> > > > >
> > > > >
> > > >
> > > separately.
> > >
> > >
> > > >
> > > > > That is, code a standalone parallel poisson eqn and use samples
> values
> > > > >
> > > > >
> > > >
> > > to
> > >
> > >
> > > >
> > > > > test it. Same for the momentum and other parts of the code. When
> each of
> > > > > them is working, combine them to form the full parallel code.
> However,
> > > > >
> > > > >
> > > >
> > > this
> > >
> > >
> > > >
> > > > > will be much more troublesome.
> > > > >
> > > > >  I hope someone can give me some recommendations.
> > > > >
> > > > >  Thank you once again.
> > > > >
> > > > >
> > > > >
> > > > >  Matthew Knepley wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > 1) There is no way to have any idea what is going on in your code
> > > > > >  without -log_summary output
> > > > > >
> > > > > > 2) Looking at that output, look at the percentage taken by the
> solver
> > > > > >  KSPSolve event. I suspect it is not the biggest component,
> because
> > > > > >  it is very scalable.
> > > > > >
> > > > > >  Matt
> > > > > >
> > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've a serial 2D CFD code. As my grid size requirement
> increases,
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > the
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > simulation takes longer. Also, memory requirement becomes a
> problem.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > Grid
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > size 've reached 1200x1200. Going higher is not possible due to
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > memory
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > problem.
> > > > > > >
> > > > > > > I tried to convert my code to a parallel one, following the
> examples
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > given.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > I also need to restructure parts of my code to enable parallel
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > looping.
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > I
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > 1st changed the PETSc solver to be parallel enabled and then I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > restructured
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > parts of my code. I proceed on as longer as the answer for a
> simple
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > test
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > case is correct. I thought it's not really possible to do any
> speed
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > testing
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > since the code is not fully parallelized yet. When I finished
> during
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > most of
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > the conversion, I found that in the actual run that it is much
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > slower,
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > although the answer is correct.
> > > > > > >
> > > > > > > So what is the remedy now? I wonder what I should do to check
> what's
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > wrong.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > Must I restart everything again? Btw, my grid size is 1200x1200.
> I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > believed
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > it should be suitable for parallel run of 4 processors? Is that
> so?
> > > > > > >
> > > > > > > Thank you.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
> >
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener