Slow speed after changing from serial to parallel

Tue Apr 15 11:44:17 CDT 2008

Hi,

Here's the summary for 1 processor. Seems like it's also using a long 
time... Can someone tell me when my mistakes possibly lie? Thank you 
very much!

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance 
Summary: ----------------------------------------------

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 
Wed Apr 16 00:39:22 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

                         Max       Max/Min        Avg      Total
Time (sec):           1.088e+03      1.00000   1.088e+03
Objects:              4.300e+01      1.00000   4.300e+01
Flops:                2.658e+11      1.00000   2.658e+11  2.658e+11
Flops/sec:            2.444e+08      1.00000   2.444e+08  2.444e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       1.460e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N 
--> 2N flops
                            and VecAXPY() for complex vectors of length 
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages 
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.0877e+03 100.0%  2.6584e+11 100.0%  0.000e+00   
0.0%  0.000e+00        0.0%  1.460e+04 100.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this 
phase
      %M - percent messages in this phase     %L - percent message 
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)
------------------------------------------------------------------------------------------------------------------------

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################

Event                Count      Time (sec)     
Flops/sec                         --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 11  0  0  0  12 11  0  0  0   216
MatSolve            7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 25 11  0  0  0  25 11  0  0  0   107
MatLUFactorNum         1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0    88
MatILUFactorSym        1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog      7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00 
7.2e+03 52 72  0  0 49  52 72  0  0 49   341
KSPSetup               1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00 
1.5e+04 93100  0  0100  93100  0  0100   262
PCSetUp                1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00 
3.0e+00  0  0  0  0  0   0  0  0  0  0    44
PCApply             7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 25 11  0  0  0  25 11  0  0  0   107
VecMDot             7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00 
7.2e+03 25 36  0  0 49  25 36  0  0 49   359
VecNorm             7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00 
7.4e+03  2  2  0  0 51   2  2  0  0 51   374
VecScale            7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  1  1  0  0  0   1  1  0  0  0   345
VecCopy              240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY              479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   206
VecMAXPY            7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 29 38  0  0  0  29 38  0  0  0   324
VecAssemblyBegin       2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize        7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 
7.4e+03  2  4  0  0 51   2  4  0  0 51   364
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

              Matrix     2              2   65632332     0
       Krylov Solver     1              1      17216     0
      Preconditioner     1              1        168     0
           Index Set     3              3    5185032     0
                 Vec    36             36  120987640     0
========================================================================================================================
Average time to get PetscTime(): 3.09944e-07
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Jan  8 22:22:08 2008                    
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 
--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 
--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 
--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 
--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 
--with-batch=1 --with-mpi-shared=0 
--with-mpi-include=/usr/local/topspin/mpi/mpich/include 
--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a 
--with-mpirun=/usr/local/topspin/mpi/mpich/bi
n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t 
--with-shared=0 
-----------------------------------------
Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 
12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
Using C compiler: mpicc -fPIC -O  
Using Fortran compiler: mpif90 -I. -fPIC -O   
-----------------------------------------
Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi 
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
I/home/enduser/g0306332/lib/hypre/include 
-I/usr/local/topspin/mpi/mpich/include    
------------------------------------------
Using C linker: mpicc -fPIC -O
Using Fortran linker: mpif90 -I. -fPIC -O  
Using libraries: 
-Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi 
-L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts 
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        
-Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib 
-L/home/enduser/g0306332/lib/hypre/lib -lHYPRE 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/usr/local/topspin/mpi/mpich/lib 
-L/usr/local/topspin/mpi/mpich/lib -lmpich 
-Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t 
-L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide 
-lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
-ldl -lmpich -libverbs -libumad -lpthread -lrt 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib 
-lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 
-Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
-ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
-libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
-L/opt/intel/cce/9.1.049/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
------------------------------------------
639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (20major+172979minor)pagefaults 0swaps
Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  
===================
00000 atlas3-c45 time ./a.out -lo  Done                     04/16/2008 
00:39:23

Barry Smith wrote:
>
>    It is taking 8776 iterations of GMRES! How many does it take on one 
> process? This is a huge
> amount.
>
> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 
> 4.8e+03 0.0e+00 10 11100100  0  10 11100100  0   217
> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 
> 0.0e+00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>
> One process is spending 2.9 times as long in the embarresingly 
> parallel MatSolve then the other process;
> this indicates a huge imbalance in the number of nonzeros on each 
> process. As Matt noticed, the partitioning
> between the two processes is terrible.
>
>   Barry
>
> On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
>> Oh sorry here's the whole information. I'm using 2 processors currently:
>>
>> ************************************************************************************************************************ 
>>
>> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript 
>> -r -fCourier9' to print this document            ***
>> ************************************************************************************************************************ 
>>
>>
>> ---------------------------------------------- PETSc Performance 
>> Summary: ----------------------------------------------
>>
>> ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by 
>> g0306332 Tue Apr 15 23:03:09 2008
>> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 
>> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>>
>>                        Max       Max/Min        Avg      Total
>> Time (sec):           1.114e+03      1.00054   1.114e+03
>> Objects:              5.400e+01      1.00000   5.400e+01
>> Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
>> Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
>> MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
>> MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
>> MPI Reductions:       8.644e+03      1.00000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                           e.g., VecAXPY() for real vectors of length 
>> N --> 2N flops
>>                           and VecAXPY() for complex vectors of length 
>> N --> 8N flops
>>
>> Summary of Stages:   ----- Time ------  ----- Flops -----  --- 
>> Messages ---  -- Message Lengths --  -- Reductions --
>>                       Avg     %Total     Avg     %Total   counts   
>> %Total     Avg         %Total   counts   %Total
>> 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04 
>> 100.0%  4.800e+03      100.0%  1.729e+04 100.0%
>>
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>> See the 'Profiling' chapter of the users' manual for details on 
>> interpreting output.
>> Phase summary info:
>>  Count: number of times phase was executed
>>  Time and Flops/sec: Max - maximum over all processors
>>                      Ratio - ratio of maximum to minimum over all 
>> processors
>>  Mess: number of messages sent
>>  Avg. len: average message length
>>  Reduct: number of global reductions
>>  Global: entire computation
>>  Stage: stages of a computation. Set stages with PetscLogStagePush() 
>> and PetscLogStagePop().
>>     %T - percent time in this phase         %F - percent flops in 
>> this phase
>>     %M - percent messages in this phase     %L - percent message 
>> lengths in this phase
>>     %R - percent reductions in this phase
>>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
>> over all processors)
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>>
>>
>>     ##########################################################
>>     #                                                        #
>>     #                          WARNING!!!                    #
>>     #                                                        #
>>     #   This code was run without the PreLoadBegin()         #
>>     #   macros. To get timing results we always recommend    #
>>     #   preloading. otherwise timing numbers may be          #
>>     #   meaningless.                                         #
>>     ##########################################################
>>
>>
>> Event                Count      Time (sec)     
>> Flops/sec                         --- Global ---  --- Stage ---   Total
>>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg 
>> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>>
>> --- Event Stage 0: Main Stage
>>
>> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 
>> 4.8e+03 0.0e+00 10 11100100  0  10 11100100  0   217
>> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 
>> 0.0e+00 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
>> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 
>> 2.4e+03 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 
>> 0.0e+00 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 
>> 4.8e+03 1.7e+04 89100100100100  89100100100100   317
>> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 
>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 
>> 0.0e+00 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 
>> 0.0e+00 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 
>> 0.0e+00 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>> VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 
>> 0.0e+00 8.8e+03  9  2  0  0 51   9  2  0  0 51    42
>> VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 
>> 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0   636
>> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 
>> 0.0e+00 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 
>> 4.8e+03 0.0e+00  0  0100100  0   0  0100100  0     0
>> VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 
>> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 
>> 0.0e+00 8.8e+03  9  4  0  0 51   9  4  0  0 51    62
>> ------------------------------------------------------------------------------------------------------------------------ 
>>
>>
>> Memory usage is given in bytes:
>>
>> Object Type          Creations   Destructions   Memory  Descendants' 
>> Mem.
>>
>> --- Event Stage 0: Main Stage
>>
>>             Matrix     4              4   49227380     0
>>      Krylov Solver     2              2      17216     0
>>     Preconditioner     2              2        256     0
>>          Index Set     5              5    2596120     0
>>                Vec    40             40   62243224     0
>>        Vec Scatter     1              1          0     0
>> ======================================================================================================================== 
>>
>> Average time to get PetscTime(): 4.05312e-07
>> Average time for MPI_Barrier(): 7.62939e-07
>> Average time for zero size MPI_Send(): 2.02656e-06
>> OptionTable: -log_summary
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> Compiled without FORTRAN kernels                              
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>> sizeof(PetscScalar) 8
>> Configure run at: Tue Jan  8 22:22:08 2008
>> Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 
>> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 
>> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 
>> --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with-vendor-compilers=intel 
>> --with-x=0 --with-hypre-dir=/home/enduser/g0306332/lib/hypre 
>> --with-debugging=0 --with-batch=1 --with-mpi-shared=0 
>> --with-mpi-include=/usr/local/topspin/mpi/mpich/include 
>> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a 
>> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun 
>> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
>> -----------------------------------------
>> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
>> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed 
>> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>> Using PETSc arch: atlas3-mpi
>> -----------------------------------------
>> Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I. 
>> -fPIC -O   -----------------------------------------
>> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 
>> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi 
>> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
>> I/home/enduser/g0306332/lib/hypre/include 
>> -I/usr/local/topspin/mpi/mpich/include    
>> ------------------------------------------
>> Using C linker: mpicc -fPIC -O
>> Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries: 
>> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi 
>> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts 
>> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc        
>> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib 
>> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib 
>> -L/usr/local/topspin/mpi/mpich/lib -lmpich 
>> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t 
>> -L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide 
>> -lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib 
>> -ldl -lmpich -libverbs -libumad -lpthread -lrt 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
>> -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/intel/fce/9.1.045/lib 
>> -L/opt/intel/fce/9.1.045/lib -lifport -lifcore -lm 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard 
>> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -Wl,-rpath,/usr/local/ofed/lib64 
>> -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib 
>> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich 
>> -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs 
>> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib 
>> -L/opt/intel/cce/9.1.049/lib 
>> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ 
>> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 
>> -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
>> ------------------------------------------
>> 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata 
>> 0maxresident)k
>> 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
>> 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata 
>> 0maxresident)k
>> 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
>> Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>>            TID   HOST_NAME   COMMAND_LINE            
>> STATUS            TERMINATION_TIME
>> ===== ========== ================  =======================  
>> ===================
>> 00000 atlas3-c05 time ./a.out -lo  Done                     
>> 04/15/2008 23:03:10
>> 00001 atlas3-c05 time ./a.out -lo  Done                     
>> 04/15/2008 23:03:10
>>
>>
>> I have a cartesian grid 600x720. Since there's 2 processors, it is 
>> partitioned to 600x360. I just use:
>>
>> call 
>> MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) 
>>
>>
>>       call MatSetFromOptions(A_mat,ierr)
>>
>>       call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
>>
>>       call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
>>
>>       call 
>> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
>>
>> total_k is actually size_x*size_y. Since it's 2d, the maximum values 
>> per row is 5. When you says setting off-process values, do you mean I 
>> insert values from 1 processor into another? I thought I insert the 
>> values into the correct processor...
>>
>> Thank you very much!
>>
>>
>>
>> Matthew Knepley wrote:
>>> 1) Please never cut out parts of the summary. All the information is 
>>> valuable,
>>>    and most times, necessary
>>>
>>> 2) You seem to have huge load imbalance (look at VecNorm). Do you 
>>> partition
>>>    the system yourself. How many processes is this?
>>>
>>> 3) You seem to be setting a huge number of off-process values in the 
>>> matrix
>>>    (see MatAssemblyBegin). Is this true? I would reorganize this part.
>>>
>>>  Matt
>>>
>>> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have converted the poisson eqn part of the CFD code to parallel. 
>>>> The grid
>>>> size tested is 600x720. For the momentum eqn, I used another serial 
>>>> linear
>>>> solver (nspcg) to prevent mixing of results. Here's the output 
>>>> summary:
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 
>>>> 4.8e+03
>>>> 0.0e+00 10 11100100  0  10 11100100  0   217
>>>> MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
>>>> MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
>>>> MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
>>>> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
>>>> MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 
>>>> 2.4e+03
>>>> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 
>>>> 0.0e+00
>>>> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
>>>> KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 
>>>> 4.8e+03
>>>> 1.7e+04 89100100100100  89100100100100   317
>>>> PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 
>>>> 0.0e+00
>>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>>> PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 
>>>> 0.0e+00
>>>> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
>>>> PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
>>>> VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 
>>>> 0.0e+00
>>>> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
>>>> *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 
>>>> 0.0e+00
>>>> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
>>>> *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
>>>> VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
>>>> VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
>>>> VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 
>>>> 4.8e+03
>>>> 0.0e+00  0  0100100  0   0  0100100  0     0*
>>>> *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 
>>>> 0.0e+00
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
>>>> *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 
>>>> 0.0e+00
>>>> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
>>>>
>>>> ------------------------------------------------------------------------------------------------------------------------ 
>>>>
>>>>  Memory usage is given in bytes:
>>>>  Object Type          Creations   Destructions   Memory  
>>>> Descendants' Mem.
>>>>    --- Event Stage 0: Main Stage
>>>>                 Matrix     4              4   49227380     0
>>>>      Krylov Solver     2              2      17216     0
>>>>     Preconditioner     2              2        256     0
>>>>          Index Set     5              5    2596120     0
>>>>                Vec    40             40   62243224     0
>>>>        Vec Scatter     1              1          0     0
>>>> ======================================================================================================================== 
>>>>
>>>> Average time to get PetscTime(): 4.05312e-07                  
>>>> Average time
>>>> for MPI_Barrier(): 7.62939e-07
>>>> Average time for zero size MPI_Send(): 2.02656e-06
>>>> OptionTable: -log_summary
>>>>
>>>>
>>>> The PETSc manual states that ratio should be close to 1. There's 
>>>> quite a
>>>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very 
>>>> big. So
>>>> what could be the cause?
>>>>
>>>> I wonder if it has to do the way I insert the matrix. My steps are:
>>>> (cartesian grids, i loop faster than j, fortran)
>>>>
>>>> For matrix A and rhs
>>>>
>>>> Insert left extreme cells values belonging to myid
>>>>
>>>> if (myid==0) then
>>>>
>>>>   insert corner cells values
>>>>
>>>>   insert south cells values
>>>>
>>>>   insert internal cells values
>>>>
>>>> else if (myid==num_procs-1) then
>>>>
>>>>   insert corner cells values
>>>>
>>>>   insert north cells values
>>>>
>>>>   insert internal cells values
>>>>
>>>> else
>>>>
>>>>   insert internal cells values
>>>>
>>>> end if
>>>>
>>>> Insert right extreme cells values belonging to myid
>>>>
>>>> All these values are entered into a big_A(size_x*size_y,5) matrix. 
>>>> int_A
>>>> stores the position of the values. I then do
>>>>
>>>> call MatZeroEntries(A_mat,ierr)
>>>>
>>>>   do k=ksta_p+1,kend_p   !for cells belonging to myid
>>>>
>>>>       do kk=1,5
>>>>
>>>>           II=k-1
>>>>
>>>>           JJ=int_A(k,kk)-1
>>>>
>>>>           call 
>>>> MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>>>>                 end do
>>>>
>>>>   end do
>>>>
>>>>   call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>>
>>>>   call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>>>
>>>>
>>>> I wonder if the problem lies here.I used the big_A matrix because I 
>>>> was
>>>> migrating from an old linear solver. Lastly, I was told to widen my 
>>>> window
>>>> to 120 characters. May I know how do I do it?
>>>>
>>>>
>>>>
>>>> Thank you very much.
>>>>
>>>> Matthew Knepley wrote:
>>>>
>>>>
>>>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi Matthew,
>>>>>>
>>>>>> I think you've misunderstood what I meant. What I'm trying to say is
>>>>>> initially I've got a serial code. I tried to convert to a 
>>>>>> parallel one.
>>>>>>
>>>> Then
>>>>
>>>>>> I tested it and it was pretty slow. Due to some work requirement, 
>>>>>> I need
>>>>>>
>>>> to
>>>>
>>>>>> go back to make some changes to my code. Since the parallel is not
>>>>>>
>>>> working
>>>>
>>>>>> well, I updated and changed the serial one.
>>>>>>
>>>>>> Well, that was a while ago and now, due to the updates and 
>>>>>> changes, the
>>>>>> serial code is different from the old converted parallel code. Some
>>>>>>
>>>> files
>>>>
>>>>>> were also deleted and I can't seem to get it working now. So I 
>>>>>> thought I
>>>>>> might as well convert the new serial code to parallel. But I'm 
>>>>>> not very
>>>>>>
>>>> sure
>>>>
>>>>>> what I should do 1st.
>>>>>>
>>>>>> Maybe I should rephrase my question in that if I just convert my
>>>>>>
>>>> poisson
>>>>
>>>>>> equation subroutine from a serial PETSc to a parallel PETSc version,
>>>>>>
>>>> will it
>>>>
>>>>>> work? Should I expect a speedup? The rest of my code is still 
>>>>>> serial.
>>>>>>
>>>>>>
>>>>>>
>>>>> You should, of course, only expect speedup in the parallel parts
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Matthew Knepley wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I am not sure why you would ever have two codes. I never do this.
>>>>>>>
>>>> PETSc
>>>>
>>>>>>> is designed to write one code to run in serial and parallel. The 
>>>>>>> PETSc
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> part
>>>>>>
>>>>>>
>>>>>>
>>>>>>> should look identical. To test, run the code yo uhave verified in
>>>>>>>
>>>> serial
>>>>
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>
>>>>>>
>>>>>>> output PETSc data structures (like Mat and Vec) using a binary 
>>>>>>> viewer.
>>>>>>> Then run in parallel with the same code, which will output the same
>>>>>>> structures. Take the two files and write a small verification code
>>>>>>>
>>>> that
>>>>
>>>>>>> loads both versions and calls MatEqual and VecEqual.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thank you Matthew. Sorry to trouble you again.
>>>>>>>>
>>>>>>>> I tried to run it with -log_summary output and I found that 
>>>>>>>> there's
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> some
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> errors in the execution. Well, I was busy with other things and I
>>>>>>>>
>>>> just
>>>>
>>>>>>>>
>>>>>> came
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> back to this problem. Some of my files on the server has also been
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> deleted.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> It has been a while and I  remember that  it worked before, only
>>>>>>>>
>>>> much
>>>>
>>>>>>>> slower.
>>>>>>>>
>>>>>>>> Anyway, most of the serial code has been updated and maybe it's
>>>>>>>>
>>>> easier
>>>>
>>>>>>>>
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> convert the new serial code instead of debugging on the old 
>>>>>>>> parallel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> code
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> now. I believe I can still reuse part of the old parallel code.
>>>>>>>>
>>>> However,
>>>>
>>>>>>>>
>>>>>> I
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> hope I can approach it better this time.
>>>>>>>>
>>>>>>>> So supposed I need to start converting my new serial code to
>>>>>>>>
>>>> parallel.
>>>>
>>>>>>>> There's 2 eqns to be solved using PETSc, the momentum and 
>>>>>>>> poisson. I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> also
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> need to parallelize other parts of my code. I wonder which 
>>>>>>>> route is
>>>>>>>>
>>>> the
>>>>
>>>>>>>> best:
>>>>>>>>
>>>>>>>> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> modify
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> other parts of my code to parallel e.g. looping, updating of 
>>>>>>>> values
>>>>>>>>
>>>> etc.
>>>>
>>>>>>>> Once the execution is fine and speedup is reasonable, then modify
>>>>>>>>
>>>> the
>>>>
>>>>>>>>
>>>>>> PETSc
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>>>>>
>>>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson eqn
>>>>>>>>
>>>> 1st
>>>>
>>>>>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>>>>>
>>>>>>>> I'm not sure if the above 2 mtds can work or if there will be
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> conflicts. Of
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> course, an alternative will be:
>>>>>>>>
>>>>>>>> 3. Do the poisson, momentum eqns and other parts of the code
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> separately.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> That is, code a standalone parallel poisson eqn and use samples
>>>>>>>>
>>>> values
>>>>
>>>>>>>>
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> test it. Same for the momentum and other parts of the code. When
>>>>>>>>
>>>> each of
>>>>
>>>>>>>> them is working, combine them to form the full parallel code.
>>>>>>>>
>>>> However,
>>>>
>>>>>>>>
>>>>>> this
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> will be much more troublesome.
>>>>>>>>
>>>>>>>> I hope someone can give me some recommendations.
>>>>>>>>
>>>>>>>> Thank you once again.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Matthew Knepley wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 1) There is no way to have any idea what is going on in your code
>>>>>>>>> without -log_summary output
>>>>>>>>>
>>>>>>>>> 2) Looking at that output, look at the percentage taken by the
>>>>>>>>>
>>>> solver
>>>>
>>>>>>>>> KSPSolve event. I suspect it is not the biggest component,
>>>>>>>>>
>>>> because
>>>>
>>>>>>>>> it is very scalable.
>>>>>>>>>
>>>>>>>>> Matt
>>>>>>>>>
>>>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've a serial 2D CFD code. As my grid size requirement
>>>>>>>>>>
>>>> increases,
>>>>
>>>>>>>>>>
>>>>>> the
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> simulation takes longer. Also, memory requirement becomes a
>>>>>>>>>>
>>>> problem.
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> Grid
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> size 've reached 1200x1200. Going higher is not possible due to
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>> memory
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> problem.
>>>>>>>>>>
>>>>>>>>>> I tried to convert my code to a parallel one, following the
>>>>>>>>>>
>>>> examples
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> given.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>> looping.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>>
>>>>>>>> I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> restructured
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> parts of my code. I proceed on as longer as the answer for a
>>>>>>>>>>
>>>> simple
>>>>
>>>>>>>>>>
>>>>>> test
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> case is correct. I thought it's not really possible to do any
>>>>>>>>>>
>>>> speed
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> testing
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> since the code is not fully parallelized yet. When I finished
>>>>>>>>>>
>>>> during
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> most of
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>> slower,
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>> although the answer is correct.
>>>>>>>>>>
>>>>>>>>>> So what is the remedy now? I wonder what I should do to check
>>>>>>>>>>
>>>> what's
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> wrong.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200.
>>>>>>>>>>
>>>> I
>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> believed
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>> it should be suitable for parallel run of 4 processors? Is that
>>>>>>>>>>
>>>> so?
>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>