Slow speed after changing from serial to parallel
Ben Tay
zonexo at gmail.com
Tue Apr 15 10:56:52 CDT 2008
Oh sorry here's the whole information. I'm using 2 processors currently:
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance
Summary: ----------------------------------------------
./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332
Tue Apr 15 23:03:09 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
Max Max/Min Avg Total
Time (sec): 1.114e+03 1.00054 1.114e+03
Objects: 5.400e+01 1.00000 5.400e+01
Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11
Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08
MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04
MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07
MPI Reductions: 8.644e+03 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length
N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages
--- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts
%Total Avg %Total counts %Total
0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04
100.0% 4.800e+03 100.0% 1.729e+04 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all
processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush()
and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this
phase
%M - percent messages in this phase %L - percent message
lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
over all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was run without the PreLoadBegin() #
# macros. To get timing results we always recommend #
# preloading. otherwise timing numbers may be #
# meaningless. #
##########################################################
Event Count Time (sec)
Flops/sec --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
0.0e+00 10 11100100 0 10 11100100 0 217
MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
0.0e+00 17 11 0 0 0 17 11 0 0 0 120
MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 140
MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0
MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
7.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
8.5e+03 50 72 0 0 49 50 72 0 0 49 363
KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
1.7e+04 89100100100100 89100100100100 317
PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 69
PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 69
PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
0.0e+00 18 11 0 0 0 18 11 0 0 0 114
VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
8.5e+03 35 36 0 0 49 35 36 0 0 49 213
VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
8.8e+03 9 2 0 0 51 9 2 0 0 51 42
VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
0.0e+00 0 1 0 0 0 0 1 0 0 0 636
VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 346
VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
0.0e+00 16 38 0 0 0 16 38 0 0 0 453
VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
6.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
0.0e+00 0 0100100 0 0 0100100 0 0
VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
8.8e+03 9 4 0 0 51 9 4 0 0 51 62
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Matrix 4 4 49227380 0
Krylov Solver 2 2 17216 0
Preconditioner 2 2 256 0
Index Set 5 5 2596120 0
Vec 40 40 62243224 0
Vec Scatter 1 1 0 0
========================================================================================================================
Average time to get PetscTime(): 4.05312e-07
Average time for MPI_Barrier(): 7.62939e-07
Average time for zero size MPI_Send(): 2.02656e-06
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
--with-batch=1 --with-mpi-shared=0
--with-mpi-include=/usr/local/topspin/mpi/mpich/include
--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
--with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
--with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
-----------------------------------------
Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul
12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
Using C compiler: mpicc -fPIC -O
Using Fortran compiler: mpif90 -I. -fPIC -O
-----------------------------------------
Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
-I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
I/home/enduser/g0306332/lib/hypre/include
-I/usr/local/topspin/mpi/mpich/include
------------------------------------------
Using C linker: mpicc -fPIC -O
Using Fortran linker: mpif90 -I. -fPIC -O
Using libraries:
-Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
-L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
-lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
-Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
-L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
-Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
-L/usr/local/topspin/mpi/mpich/lib -lmpich
-Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t
-L/opt/intel/cmkl/8.1.1/lib/em64t -lmkl_lapack -lmkl_em64t -lguide
-lpthread -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib
-ldl -lmpich -libverbs -libumad -lpthread -lrt
-Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
-Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib
-lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
-Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
-Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib
-ldl -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64
-libverbs -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
-L/opt/intel/cce/9.1.049/lib
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
-L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
------------------------------------------
1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (28major+153248minor)pagefaults 0swaps
387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (18major+158175minor)pagefaults 0swaps
Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ =======================
===================
00000 atlas3-c05 time ./a.out -lo Done 04/15/2008
23:03:10
00001 atlas3-c05 time ./a.out -lo Done 04/15/2008
23:03:10
I have a cartesian grid 600x720. Since there's 2 processors, it is
partitioned to 600x360. I just use:
call
MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)
call MatSetFromOptions(A_mat,ierr)
call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
call
VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
total_k is actually size_x*size_y. Since it's 2d, the maximum values per
row is 5. When you says setting off-process values, do you mean I insert
values from 1 processor into another? I thought I insert the values into
the correct processor...
Thank you very much!
Matthew Knepley wrote:
> 1) Please never cut out parts of the summary. All the information is valuable,
> and most times, necessary
>
> 2) You seem to have huge load imbalance (look at VecNorm). Do you partition
> the system yourself. How many processes is this?
>
> 3) You seem to be setting a huge number of off-process values in the matrix
> (see MatAssemblyBegin). Is this true? I would reorganize this part.
>
> Matt
>
> On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
>
>> Hi,
>>
>> I have converted the poisson eqn part of the CFD code to parallel. The grid
>> size tested is 600x720. For the momentum eqn, I used another serial linear
>> solver (nspcg) to prevent mixing of results. Here's the output summary:
>>
>> --- Event Stage 0: Main Stage
>>
>> MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
>> 0.0e+00 10 11100100 0 10 11100100 0 217
>> MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
>> 0.0e+00 17 11 0 0 0 17 11 0 0 0 120
>> MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 140
>> MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
>> 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0*
>> MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
>> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
>> 8.5e+03 50 72 0 0 49 50 72 0 0 49 363
>> KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
>> 1.7e+04 89100100100100 89100100100100 317
>> PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
>> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
>> PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
>> 3.0e+00 0 0 0 0 0 0 0 0 0 0 69
>> PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
>> 0.0e+00 18 11 0 0 0 18 11 0 0 0 114
>> VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
>> 8.5e+03 35 36 0 0 49 35 36 0 0 49 213
>> *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
>> 8.8e+03 9 2 0 0 51 9 2 0 0 51 42*
>> *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 636*
>> VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>> VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 346
>> VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
>> 0.0e+00 16 38 0 0 0 16 38 0 0 0 453
>> VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
>> 0.0e+00 0 0100100 0 0 0100100 0 0*
>> *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0*
>> *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
>> 8.8e+03 9 4 0 0 51 9 4 0 0 51 62*
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Memory usage is given in bytes:
>> Object Type Creations Destructions Memory Descendants' Mem.
>> --- Event Stage 0: Main Stage
>> Matrix 4 4 49227380 0
>> Krylov Solver 2 2 17216 0
>> Preconditioner 2 2 256 0
>> Index Set 5 5 2596120 0
>> Vec 40 40 62243224 0
>> Vec Scatter 1 1 0 0
>> ========================================================================================================================
>> Average time to get PetscTime(): 4.05312e-07 Average time
>> for MPI_Barrier(): 7.62939e-07
>> Average time for zero size MPI_Send(): 2.02656e-06
>> OptionTable: -log_summary
>>
>>
>> The PETSc manual states that ratio should be close to 1. There's quite a
>> few *(in bold)* which are >1 and MatAssemblyBegin seems to be very big. So
>> what could be the cause?
>>
>> I wonder if it has to do the way I insert the matrix. My steps are:
>> (cartesian grids, i loop faster than j, fortran)
>>
>> For matrix A and rhs
>>
>> Insert left extreme cells values belonging to myid
>>
>> if (myid==0) then
>>
>> insert corner cells values
>>
>> insert south cells values
>>
>> insert internal cells values
>>
>> else if (myid==num_procs-1) then
>>
>> insert corner cells values
>>
>> insert north cells values
>>
>> insert internal cells values
>>
>> else
>>
>> insert internal cells values
>>
>> end if
>>
>> Insert right extreme cells values belonging to myid
>>
>> All these values are entered into a big_A(size_x*size_y,5) matrix. int_A
>> stores the position of the values. I then do
>>
>> call MatZeroEntries(A_mat,ierr)
>>
>> do k=ksta_p+1,kend_p !for cells belonging to myid
>>
>> do kk=1,5
>>
>> II=k-1
>>
>> JJ=int_A(k,kk)-1
>>
>> call MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
>> end do
>>
>> end do
>>
>> call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>
>> call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
>>
>>
>> I wonder if the problem lies here.I used the big_A matrix because I was
>> migrating from an old linear solver. Lastly, I was told to widen my window
>> to 120 characters. May I know how do I do it?
>>
>>
>>
>> Thank you very much.
>>
>> Matthew Knepley wrote:
>>
>>
>>> On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>
>>>
>>>
>>>> Hi Matthew,
>>>>
>>>> I think you've misunderstood what I meant. What I'm trying to say is
>>>> initially I've got a serial code. I tried to convert to a parallel one.
>>>>
>> Then
>>
>>>> I tested it and it was pretty slow. Due to some work requirement, I need
>>>>
>> to
>>
>>>> go back to make some changes to my code. Since the parallel is not
>>>>
>> working
>>
>>>> well, I updated and changed the serial one.
>>>>
>>>> Well, that was a while ago and now, due to the updates and changes, the
>>>> serial code is different from the old converted parallel code. Some
>>>>
>> files
>>
>>>> were also deleted and I can't seem to get it working now. So I thought I
>>>> might as well convert the new serial code to parallel. But I'm not very
>>>>
>> sure
>>
>>>> what I should do 1st.
>>>>
>>>> Maybe I should rephrase my question in that if I just convert my
>>>>
>> poisson
>>
>>>> equation subroutine from a serial PETSc to a parallel PETSc version,
>>>>
>> will it
>>
>>>> work? Should I expect a speedup? The rest of my code is still serial.
>>>>
>>>>
>>>>
>>> You should, of course, only expect speedup in the parallel parts
>>>
>>> Matt
>>>
>>>
>>>
>>>
>>>> Thank you very much.
>>>>
>>>>
>>>>
>>>> Matthew Knepley wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I am not sure why you would ever have two codes. I never do this.
>>>>>
>> PETSc
>>
>>>>> is designed to write one code to run in serial and parallel. The PETSc
>>>>>
>>>>>
>>>>>
>>>> part
>>>>
>>>>
>>>>
>>>>> should look identical. To test, run the code yo uhave verified in
>>>>>
>> serial
>>
>>>>>
>>>> and
>>>>
>>>>
>>>>
>>>>> output PETSc data structures (like Mat and Vec) using a binary viewer.
>>>>> Then run in parallel with the same code, which will output the same
>>>>> structures. Take the two files and write a small verification code
>>>>>
>> that
>>
>>>>> loads both versions and calls MatEqual and VecEqual.
>>>>>
>>>>> Matt
>>>>>
>>>>> On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thank you Matthew. Sorry to trouble you again.
>>>>>>
>>>>>> I tried to run it with -log_summary output and I found that there's
>>>>>>
>>>>>>
>>>>>>
>>>> some
>>>>
>>>>
>>>>
>>>>>> errors in the execution. Well, I was busy with other things and I
>>>>>>
>> just
>>
>>>>>>
>>>> came
>>>>
>>>>
>>>>
>>>>>> back to this problem. Some of my files on the server has also been
>>>>>>
>>>>>>
>>>>>>
>>>> deleted.
>>>>
>>>>
>>>>
>>>>>> It has been a while and I remember that it worked before, only
>>>>>>
>> much
>>
>>>>>> slower.
>>>>>>
>>>>>> Anyway, most of the serial code has been updated and maybe it's
>>>>>>
>> easier
>>
>>>>>>
>>>> to
>>>>
>>>>
>>>>
>>>>>> convert the new serial code instead of debugging on the old parallel
>>>>>>
>>>>>>
>>>>>>
>>>> code
>>>>
>>>>
>>>>
>>>>>> now. I believe I can still reuse part of the old parallel code.
>>>>>>
>> However,
>>
>>>>>>
>>>> I
>>>>
>>>>
>>>>
>>>>>> hope I can approach it better this time.
>>>>>>
>>>>>> So supposed I need to start converting my new serial code to
>>>>>>
>> parallel.
>>
>>>>>> There's 2 eqns to be solved using PETSc, the momentum and poisson. I
>>>>>>
>>>>>>
>>>>>>
>>>> also
>>>>
>>>>
>>>>
>>>>>> need to parallelize other parts of my code. I wonder which route is
>>>>>>
>> the
>>
>>>>>> best:
>>>>>>
>>>>>> 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
>>>>>>
>>>>>>
>>>>>>
>>>> modify
>>>>
>>>>
>>>>
>>>>>> other parts of my code to parallel e.g. looping, updating of values
>>>>>>
>> etc.
>>
>>>>>> Once the execution is fine and speedup is reasonable, then modify
>>>>>>
>> the
>>
>>>>>>
>>>> PETSc
>>>>
>>>>
>>>>
>>>>>> part - poisson eqn 1st followed by the momentum eqn.
>>>>>>
>>>>>> 2. Reverse the above order ie modify the PETSc part - poisson eqn
>>>>>>
>> 1st
>>
>>>>>> followed by the momentum eqn. Then do other parts of my code.
>>>>>>
>>>>>> I'm not sure if the above 2 mtds can work or if there will be
>>>>>>
>>>>>>
>>>>>>
>>>> conflicts. Of
>>>>
>>>>
>>>>
>>>>>> course, an alternative will be:
>>>>>>
>>>>>> 3. Do the poisson, momentum eqns and other parts of the code
>>>>>>
>>>>>>
>>>>>>
>>>> separately.
>>>>
>>>>
>>>>
>>>>>> That is, code a standalone parallel poisson eqn and use samples
>>>>>>
>> values
>>
>>>>>>
>>>> to
>>>>
>>>>
>>>>
>>>>>> test it. Same for the momentum and other parts of the code. When
>>>>>>
>> each of
>>
>>>>>> them is working, combine them to form the full parallel code.
>>>>>>
>> However,
>>
>>>>>>
>>>> this
>>>>
>>>>
>>>>
>>>>>> will be much more troublesome.
>>>>>>
>>>>>> I hope someone can give me some recommendations.
>>>>>>
>>>>>> Thank you once again.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Matthew Knepley wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> 1) There is no way to have any idea what is going on in your code
>>>>>>> without -log_summary output
>>>>>>>
>>>>>>> 2) Looking at that output, look at the percentage taken by the
>>>>>>>
>> solver
>>
>>>>>>> KSPSolve event. I suspect it is not the biggest component,
>>>>>>>
>> because
>>
>>>>>>> it is very scalable.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>> On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay <zonexo at gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've a serial 2D CFD code. As my grid size requirement
>>>>>>>>
>> increases,
>>
>>>>>>>>
>>>> the
>>>>
>>>>
>>>>
>>>>>>>> simulation takes longer. Also, memory requirement becomes a
>>>>>>>>
>> problem.
>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> Grid
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> size 've reached 1200x1200. Going higher is not possible due to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>> memory
>>>>
>>>>
>>>>
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> I tried to convert my code to a parallel one, following the
>>>>>>>>
>> examples
>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> given.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> I also need to restructure parts of my code to enable parallel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>> looping.
>>>>
>>>>
>>>>
>>>>>>>>
>>>>>> I
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> 1st changed the PETSc solver to be parallel enabled and then I
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> restructured
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> parts of my code. I proceed on as longer as the answer for a
>>>>>>>>
>> simple
>>
>>>>>>>>
>>>> test
>>>>
>>>>
>>>>
>>>>>>>> case is correct. I thought it's not really possible to do any
>>>>>>>>
>> speed
>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> testing
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> since the code is not fully parallelized yet. When I finished
>>>>>>>>
>> during
>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> most of
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> the conversion, I found that in the actual run that it is much
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>> slower,
>>>>
>>>>
>>>>
>>>>>>>> although the answer is correct.
>>>>>>>>
>>>>>>>> So what is the remedy now? I wonder what I should do to check
>>>>>>>>
>> what's
>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> wrong.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Must I restart everything again? Btw, my grid size is 1200x1200.
>>>>>>>>
>> I
>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> believed
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> it should be suitable for parallel run of 4 processors? Is that
>>>>>>>>
>> so?
>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>
More information about the petsc-users
mailing list