Slow speed after changing from serial to parallel

Tue Apr 15 11:09:10 CDT 2008

    It is taking 8776 iterations of GMRES! How many does it take on  
one process? This is a huge
amount.

MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e 
+03 0.0e+00 10 11100100  0  10 11100100  0   217
MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e 
+00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120

One process is spending 2.9 times as long in the embarresingly  
parallel MatSolve then the other process;
this indicates a huge imbalance in the number of nonzeros on each  
process. As Matt noticed, the partitioning
between the two processes is terrible.

   Barry

On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
> Oh sorry here's the whole information. I'm using 2 processors  
> currently:
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by  
> g0306332 Tue Apr 15 23:03:09 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                        Max       Max/Min        Avg      Total
> Time (sec):           1.114e+03      1.00054   1.114e+03
> Objects:              5.400e+01      1.00000   5.400e+01
> Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
> Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
> MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
> MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
> MPI Reductions:       8.644e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                           e.g., VecAXPY() for real vectors of length  
> N --> 2N flops
>                           and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                       Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04  
> 100.0%  4.800e+03      100.0%  1.729e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>                      Ratio - ratio of maximum to minimum over all  
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush()  
> and PetscLogStagePop().
>     %T - percent time in this phase         %F - percent flops in  
> this phase
>     %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>     %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time  
> over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>     ##########################################################
>     #                                                        #
>     #                          WARNING!!!                    #
>     #                                                        #
>     #   This code was run without the PreLoadBegin()         #
>     #   macros. To get timing results we always recommend    #
>     #   preloading. otherwise timing numbers may be          #
>     #   meaningless.                                         #
>     ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e 
> +03 0.0e+00 10 11100100  0  10 11100100  0   217
> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e 
> +00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00  
> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e 
> +03 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e 
> +00 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e 
> +03 1.7e+04 89100100100100  89100100100100   317
> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e 
> +00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e 
> +00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e 
> +00 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e 
> +00 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
> VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e 
> +00 8.8e+03  9  2  0  0 51   9  2  0  0 51    42
> VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e 
> +00 0.0e+00  0  1  0  0  0   0  1  0  0  0   636
> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e 
> +00 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e 
> +03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e 
> +00 8.8e+03  9  4  0  0 51   9  4  0  0 51    62
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>             Matrix     4              4   49227380     0
>      Krylov Solver     2              2      17216     0
>     Preconditioner     2              2        256     0
>          Index Set     5              5    2596120     0
>                Vec    40             40   62243224     0
>        Vec Scatter     1              1          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 4.05312e-07
> Average time for MPI_Barrier(): 7.62939e-07
> Average time for zero size MPI_Send(): 2.02656e-06
> OptionTable: -log_summary
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> Compiled without FORTRAN kernels                               
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
> Configure options: --with-memcmp-ok --sizeof_char=1 -- 
> sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- 
> sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- 
> bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- 
> vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ 
> g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- 
> shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- 
> with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- 
> mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- 
> dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> -----------------------------------------
> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I.  
> -fPIC -O   -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ 
> home/enduser/g0306332/petsc-2.3.3-p8/include -
> I/home/enduser/g0306332/lib/hypre/include -I/usr/local/topspin/mpi/ 
> mpich/include    ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries: -Wl,- 
> rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -L/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts - 
> lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc         
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib -L/home/enduser/ 
> g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/mvapich/0.9.9/gen2/ 
> lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/ 
> gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib -L/usr/ 
> local/topspin/mpi/mpich/lib -lmpich -Wl,-rpath,/opt/intel/cmkl/8.1.1/ 
> lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t  
> -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/ 
> lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/ 
> gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt -Wl,-rpath,/ 
> opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat- 
> linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo - 
> lirc -lgcc_s -lirc_s -lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/ 
> lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/ 
> fce/9.1.045/lib -lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/ 
> gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/ 
> 9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- 
> rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - 
> Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib - 
> Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/ 
> lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/ 
> usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ 
> + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl  
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 - 
> libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/ 
> lib -L/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,- 
> rpath,/usr/lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s - 
> lirc_s -ldl -lc
> ------------------------------------------
> 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata  
> 0maxresident)k
> 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
> 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata  
> 0maxresident)k
> 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
> Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>            TID   HOST_NAME   COMMAND_LINE             
> STATUS            TERMINATION_TIME
> ===== ========== ================  =======================   
> ===================
> 00000 atlas3-c05 time ./a.out -lo  Done                      
> 04/15/2008 23:03:10
> 00001 atlas3-c05 time ./a.out -lo  Done                      
> 04/15/2008 23:03:10
>
>
> I have a cartesian grid 600x720. Since there's 2 processors, it is  
> partitioned to 600x360. I just use:
>
> call  
> MatCreateMPIAIJ 
> (MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k, 
> 5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)
>
>       call MatSetFromOptions(A_mat,ierr)
>
>       call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
>
>       call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
>
>       call  
> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
>
> total_k is actually size_x*size_y. Since it's 2d, the maximum values  
> per row is 5. When you says setting off-process values, do you mean  
> I insert values from 1 processor into another? I thought I insert  
> the values into the correct processor...
>
> Thank you very much!
>
>
>
> Matthew Knepley wrote:
>> 1) Please never cut out parts of the summary. All the information  
>> is valuable,
>>    and most times, necessary
>>
>> 2) You seem to have huge load imbalance (look at VecNorm). Do you  
>> partition
>>    the system yourself. How many processes is this?
>>
>> 3) You seem to be setting a huge number of off-process values in  
>> the matrix
>>    (see MatAssemblyBegin). Is this true? I would reorganize this  
>> part.
>>
>>  Matt
>>
>> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have converted the poisson eqn part of the CFD code to parallel.  
>>> The grid
>>> size tested is 600x720. For the momentum eqn, I used another  
>>> serial linear
>>> solver (nspcg) to prevent mixing of results. Here's the output  
>>> summary:
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04  
>>> 4.8e+03
>>> 0.0e+00 10 11100100  0  10 11100100  0   217
>>> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>>> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>>> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e 
>>> +00
>>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>>> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00  
>>> 2.4e+03
>>> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00  
>>> 0.0e+00
>>> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>>> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04  
>>> 4.8e+03
>>> 1.7e+04 89100100100100  89100100100100   317
>>> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00  
>>> 0.0e+00
>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00  
>>> 0.0e+00
>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>>> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00  
>>> 0.0e+00
>>> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>>> *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00  
>>> 0.0e+00
>>> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>>> *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>>> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>>> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>>> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04  
>>> 4.8e+03
>>> 0.0e+00  0  0100100  0   0  0100100  0     0*
>>> *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00  
>>> 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>>> *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00  
>>> 0.0e+00
>>> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>  Memory usage is given in bytes:
>>>  Object Type          Creations   Destructions   Memory   
>>> Descendants' Mem.
>>>    --- Event Stage 0: Main Stage
>>>                 Matrix     4              4   49227380     0
>>>      Krylov Solver     2              2      17216     0
>>>     Preconditioner     2              2        256     0
>>>          Index Set     5              5    2596120     0
>>>                Vec    40             40   62243224     0
>>>        Vec Scatter     1              1          0     0
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> ====================================================================
>>> Average time to get PetscTime(): 4.05312e-07                   
>>> Average time
>>> for MPI_Barrier(): 7.62939e-07
>>> Average time for zero size MPI_Send(): 2.02656e-06
>>> OptionTable: -log_summary
>>>
>>>
>>> The PETSc manual states that ratio should be close to 1. There's  
>>> quite a
>>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very  
>>> big. So
>>> what could be the cause?
>>>
>>> I wonder if it has to do the way I insert the matrix. My steps are:
>>> (cartesian grids, i loop faster than j, fortran)
>>>
>>> For matrix A and rhs
>>>
>>> Insert left extreme cells values belonging to myid
>>>
>>> if (myid==0) then
>>>
>>>   insert corner cells values
>>>
>>>   insert south cells values
>>>
>>>   insert internal cells values
>>>
>>> else if (myid==num_procs-1) then
>>>
>>>   insert corner cells values
>>>
>>>   insert north cells values
>>>
>>>   insert internal cells values
>>>
>>> else
>>>
>>>   insert internal cells values
>>>
>>> end if
>>>
>>> Insert right extreme cells values belonging to myid
>>>
>>> All these values are entered into a big_A(size_x*size_y,5) matrix.  
>>> int_A
>>> stores the position of the values. I then do
>>>
>>> call MatZeroEntries(A_mat,ierr)
>>>
>>>   do k=ksta_p+1,kend_p   !for cells belonging to myid
>>>
>>>       do kk=1,5
>>>
>>>           II=k-1
>>>
>>>           JJ=int_A(k,kk)-1
>>>
>>>           call MatSetValues(A_mat,1,II, 
>>> 1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>>>                 end do
>>>
>>>   end do
>>>
>>>   call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>
>>>   call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>
>>>
>>> I wonder if the problem lies here.I used the big_A matrix because  
>>> I was
>>> migrating from an old linear solver. Lastly, I was told to widen  
>>> my window
>>> to 120 characters. May I know how do I do it?
>>>
>>>
>>>
>>> Thank you very much.
>>>
>>> Matthew Knepley wrote:
>>>
>>>
>>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>> Hi Matthew,
>>>>>
>>>>> I think you've misunderstood what I meant. What I'm trying to  
>>>>> say is
>>>>> initially I've got a serial code. I tried to convert to a  
>>>>> parallel one.
>>>>>
>>> Then
>>>
>>>>> I tested it and it was pretty slow. Due to some work  
>>>>> requirement, I need
>>>>>
>>> to
>>>
>>>>> go back to make some changes to my code. Since the parallel is not
>>>>>
>>> working
>>>
>>>>> well, I updated and changed the serial one.
>>>>>
>>>>> Well, that was a while ago and now, due to the updates and  
>>>>> changes, the
>>>>> serial code is different from the old converted parallel code.  
>>>>> Some
>>>>>
>>> files
>>>
>>>>> were also deleted and I can't seem to get it working now. So I  
>>>>> thought I
>>>>> might as well convert the new serial code to parallel. But I'm  
>>>>> not very
>>>>>
>>> sure
>>>
>>>>> what I should do 1st.
>>>>>
>>>>> Maybe I should rephrase my question in that if I just convert my
>>>>>
>>> poisson
>>>
>>>>> equation subroutine from a serial PETSc to a parallel PETSc  
>>>>> version,
>>>>>
>>> will it
>>>
>>>>> work? Should I expect a speedup? The rest of my code is still  
>>>>> serial.
>>>>>
>>>>>
>>>>>
>>>> You should, of course, only expect speedup in the parallel parts
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>>
>>>>> Thank you very much.
>>>>>
>>>>>
>>>>>
>>>>> Matthew Knepley wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am not sure why you would ever have two codes. I never do this.
>>>>>>
>>> PETSc
>>>
>>>>>> is designed to write one code to run in serial and parallel.  
>>>>>> The PETSc
>>>>>>
>>>>>>
>>>>>>
>>>>> part
>>>>>
>>>>>
>>>>>
>>>>>> should look identical. To test, run the code yo uhave verified in
>>>>>>
>>> serial
>>>
>>>>>>
>>>>> and
>>>>>
>>>>>
>>>>>
>>>>>> output PETSc data structures (like Mat and Vec) using a binary  
>>>>>> viewer.
>>>>>> Then run in parallel with the same code, which will output the  
>>>>>> same
>>>>>> structures. Take the two files and write a small verification  
>>>>>> code
>>>>>>
>>> that
>>>
>>>>>> loads both versions and calls MatEqual and VecEqual.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com>  
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thank you Matthew. Sorry to trouble you again.
>>>>>>>
>>>>>>> I tried to run it with -log_summary output and I found that  
>>>>>>> there's
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> some
>>>>>
>>>>>
>>>>>
>>>>>>> errors in the execution. Well, I was busy with other things  
>>>>>>> and I
>>>>>>>
>>> just
>>>
>>>>>>>
>>>>> came
>>>>>
>>>>>
>>>>>
>>>>>>> back to this problem. Some of my files on the server has also  
>>>>>>> been
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> deleted.
>>>>>
>>>>>
>>>>>
>>>>>>> It has been a while and I  remember that  it worked before, only
>>>>>>>
>>> much
>>>
>>>>>>> slower.
>>>>>>>
>>>>>>> Anyway, most of the serial code has been updated and maybe it's
>>>>>>>
>>> easier
>>>
>>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>
>>>>>>> convert the new serial code instead of debugging on the old  
>>>>>>> parallel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> code
>>>>>
>>>>>
>>>>>
>>>>>>> now. I believe I can still reuse part of the old parallel code.
>>>>>>>
>>> However,
>>>
>>>>>>>
>>>>> I
>>>>>
>>>>>
>>>>>
>>>>>>> hope I can approach it better this time.
>>>>>>>
>>>>>>> So supposed I need to start converting my new serial code to
>>>>>>>
>>> parallel.
>>>
>>>>>>> There's 2 eqns to be solved using PETSc, the momentum and  
>>>>>>> poisson. I
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> also
>>>>>
>>>>>
>>>>>
>>>>>>> need to parallelize other parts of my code. I wonder which  
>>>>>>> route is
>>>>>>>
>>> the
>>>
>>>>>>> best:
>>>>>>>
>>>>>>> 1. Don't change the PETSc part ie continue using  
>>>>>>> PETSC_COMM_SELF,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> modify
>>>>>
>>>>>
>>>>>
>>>>>>> other parts of my code to parallel e.g. looping, updating of  
>>>>>>> values
>>>>>>>
>>> etc.
>>>
>>>>>>> Once the execution is fine and speedup is reasonable, then  
>>>>>>> modify
>>>>>>>
>>> the
>>>
>>>>>>>
>>>>> PETSc
>>>>>
>>>>>
>>>>>
>>>>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>>>>
>>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson  
>>>>>>> eqn
>>>>>>>
>>> 1st
>>>
>>>>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>>>>
>>>>>>> I'm not sure if the above 2 mtds can work or if there will be
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> conflicts. Of
>>>>>
>>>>>
>>>>>
>>>>>>> course, an alternative will be:
>>>>>>>
>>>>>>> 3. Do the poisson, momentum eqns and other parts of the code
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> separately.
>>>>>
>>>>>
>>>>>
>>>>>>> That is, code a standalone parallel poisson eqn and use samples
>>>>>>>
>>> values
>>>
>>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>
>>>>>>> test it. Same for the momentum and other parts of the code. When
>>>>>>>
>>> each of
>>>
>>>>>>> them is working, combine them to form the full parallel code.
>>>>>>>
>>> However,
>>>
>>>>>>>
>>>>> this
>>>>>
>>>>>
>>>>>
>>>>>>> will be much more troublesome.
>>>>>>>
>>>>>>> I hope someone can give me some recommendations.
>>>>>>>
>>>>>>> Thank you once again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Matthew Knepley wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 1) There is no way to have any idea what is going on in your  
>>>>>>>> code
>>>>>>>> without -log_summary output
>>>>>>>>
>>>>>>>> 2) Looking at that output, look at the percentage taken by the
>>>>>>>>
>>> solver
>>>
>>>>>>>> KSPSolve event. I suspect it is not the biggest component,
>>>>>>>>
>>> because
>>>
>>>>>>>> it is very scalable.
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com>  
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've a serial 2D CFD code. As my grid size requirement
>>>>>>>>>
>>> increases,
>>>
>>>>>>>>>
>>>>> the
>>>>>
>>>>>
>>>>>
>>>>>>>>> simulation takes longer. Also, memory requirement becomes a
>>>>>>>>>
>>> problem.
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> Grid
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> size 've reached 1200x1200. Going higher is not possible due  
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> memory
>>>>>
>>>>>
>>>>>
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> I tried to convert my code to a parallel one, following the
>>>>>>>>>
>>> examples
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> given.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> looping.
>>>>>
>>>>>
>>>>>
>>>>>>>>>
>>>>>>> I
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> restructured
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> parts of my code. I proceed on as longer as the answer for a
>>>>>>>>>
>>> simple
>>>
>>>>>>>>>
>>>>> test
>>>>>
>>>>>
>>>>>
>>>>>>>>> case is correct. I thought it's not really possible to do any
>>>>>>>>>
>>> speed
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> testing
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> since the code is not fully parallelized yet. When I finished
>>>>>>>>>
>>> during
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> most of
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> slower,
>>>>>
>>>>>
>>>>>
>>>>>>>>> although the answer is correct.
>>>>>>>>>
>>>>>>>>> So what is the remedy now? I wonder what I should do to check
>>>>>>>>>
>>> what's
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> wrong.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> Must I restart everything again? Btw, my grid size is  
>>>>>>>>> 1200x1200.
>>>>>>>>>
>>> I
>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> believed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> it should be suitable for parallel run of 4 processors? Is  
>>>>>>>>> that
>>>>>>>>>
>>> so?
>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>