[petsc-users] Read in sequential, solve in parallel
Moinier, Pierre (UK)
Pierre.Moinier at baesystems.com
Wed Sep 29 07:51:55 CDT 2010
Jed,
The matrix is 1000000x1000000 and I have 4996000 non zeros
Here is the output for a single proc:
-bash-3.2$ cat petsc.sub.o5134
Reading matrix completed.
Reading RHS completed.
Solving ...
44.827695 seconds elapsed
************************************************************************
************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************
************************************************
---------------------------------------------- PETSc Performance
Summary: ----------------------------------------------
./mm2petsc2 on a linux-gnu named comp01 with 1 processor, by moinier Wed
Sep 29 13:44:03 2010
Using Petsc Release Version 3.1.0, Patch 3, Fri Jun 4 15:34:52 CDT 2010
Max Max/Min Avg Total
Time (sec): 4.571e+01 1.00000 4.571e+01
Objects: 2.400e+01 1.00000 2.400e+01
Flops: 3.428e+10 1.00000 3.428e+10 3.428e+10
Flops/sec: 7.499e+08 1.00000 7.499e+08 7.499e+08
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 2.700e+01 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length
N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages
--- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts
%Total Avg %Total counts %Total
0: Main Stage: 4.5715e+01 100.0% 3.4280e+10 100.0% 0.000e+00
0.0% 0.000e+00 0.0% 6.000e+00 22.2%
------------------------------------------------------------------------
------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all
processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush()
and PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this
phase
%M - percent messages in this phase %L - percent message
lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
over all processors)
------------------------------------------------------------------------
------------------------------------------------
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------
------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 1633 1.0 1.6247e+01 1.0 1.47e+10 1.0 0.0e+00 0.0e+00
0.0e+00 36 43 0 0 0 36 43 0 0 0 904
MatAssemblyBegin 1 1.0 1.3158e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 7 0 0 0 0 33 0
MatAssemblyEnd 1 1.0 5.1668e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 11 0 0 0 0 50 0
MatLoad 1 1.0 7.9792e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
5.0e+00 2 0 0 0 19 2 0 0 0 83 0
VecDot 3266 1.0 4.4834e+00 1.0 6.53e+09 1.0 0.0e+00 0.0e+00
0.0e+00 10 19 0 0 0 10 19 0 0 0 1457
VecNorm 1634 1.0 1.2968e+01 1.0 3.27e+09 1.0 0.0e+00 0.0e+00
0.0e+00 28 10 0 0 0 28 10 0 0 0 252
VecCopy 1636 1.0 2.9524e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 6 0 0 0 0 6 0 0 0 0 0
VecSet 1 1.0 1.7080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 3266 1.0 5.5580e+00 1.0 6.53e+09 1.0 0.0e+00 0.0e+00
0.0e+00 12 19 0 0 0 12 19 0 0 0 1175
VecAYPX 1632 1.0 2.5961e+00 1.0 3.26e+09 1.0 0.0e+00 0.0e+00
0.0e+00 6 10 0 0 0 6 10 0 0 0 1257
VecAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecLoad 1 1.0 7.8766e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 4 0 0 0 0 17 0
VecScatterBegin 1633 1.0 1.3146e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSetup 1 1.0 8.7240e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 4.4828e+01 1.0 3.43e+10 1.0 0.0e+00 0.0e+00
0.0e+00 98100 0 0 0 98100 0 0 0 765
PCSetUp 1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
PCApply 1634 1.0 2.9503e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 6 0 0 0 0 6 0 0 0 0 0
------------------------------------------------------------------------
------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants'
Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Viewer 2 2 1136 0
Matrix 7 3 6560 0
Vec 10 2 2696 0
Vec Scatter 1 0 0 0
Index Set 2 2 1056 0
Krylov Solver 1 0 0 0
Preconditioner 1 0 0 0
========================================================================
================================================
Average time to get PetscTime(): 1.19209e-07
#PETSc Option Table entries:
-ksp_rtol 1.e-6
-ksp_type cg
-log_summary
-pc_type none
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Wed Sep 1 13:08:57 2010
Configure options: --known-level1-dcache-size=65536
--known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
--known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
--known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
--known-sizeof-long-long=8 --known-sizeof-float=4
--known-sizeof-double=8 --known-sizeof-size_t=8 --known-bits-per-byte=8
--known-sizeof-MPI_Comm=8 --known-sizeof-MPI_Fint=4
--known-mpi-long-double=1
--with-mpi-dir=/apps/utils/linux64/openmpi-1.4.1 --with-batch
--with-fc=0 --known-mpi-shared=1 --with-shared --with-debugging=0
COPTFLAGS=-O3 CXXOPTFLAGS=-O3 --with-clanguage=cxx
--download-c-blas-lapack=yes
-----------------------------------------
Libraries compiled on Wed Sep 1 13:09:50 BST 2010 on lnx102
Machine characteristics: Linux lnx102 2.6.31.12-0.2-default #1 SMP
2010-03-16 21:25:39 +0100 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /home/atc/neilson/opt/petsc-3.1-p3
Using PETSc arch: linux-gnu-cxx-opt-lnx102
-----------------------------------------
Using C compiler: /apps/utils/linux64/openmpi-1.4.1/bin/mpicxx -Wall
-Wwrite-strings -Wno-strict-aliasing -O3 -fPIC
Using Fortran compiler:
-----------------------------------------
Using include paths:
-I/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/include
-I/home/atc/neilson/opt/petsc-3.1-p3/include
-I/apps/utils/linux64/openmpi-1.4.1/include
------------------------------------------
Using C linker: /apps/utils/linux64/openmpi-1.4.1/bin/mpicxx -Wall
-Wwrite-strings -Wno-strict-aliasing -O3
Using Fortran linker:
Using libraries:
-Wl,-rpath,/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/l
ib -L/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/lib
-lpetsc -lX11
-Wl,-rpath,/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/l
ib -L/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/lib
-lf2clapack -lf2cblas -lmpi_cxx -lstdc++ -ldl
------------------------------------------
-----Original Message-----
From: petsc-users-bounces at mcs.anl.gov
[mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Jed Brown
Sent: 29 September 2010 13:40
To: PETSc users list
Subject: Re: [petsc-users] Read in sequential, solve in parallel
*** WARNING ***
This message has originated outside your organisation,
either from an external partner or the Global Internet.
Keep this in mind if you answer this message.
On Wed, Sep 29, 2010 at 14:34, Moinier, Pierre (UK)
<Pierre.Moinier at baesystems.com> wrote:
> Jed,
>
> Thanks for your help and thanks also to all of the others who have
> replied!. I made some progress and wrote a new code that runs in
> parallel. However the results seems to show that the time requires to
> solve the linear systems is the same whether I use 1, 2 or 4
> processors... Surely I am missing something. I copied the code below.
> For info, I run the executable as: ./test -ksp_type cg -ksp_rtol 1.e-6
> -pc_type none
How big is the matrix (dimensions and number of nonzeros)? Run with
-log_summary and send the output. This problem is mostly memory
bandwidth limited and a single core can saturate most of the memory bus
for a whole socket on most architectures. If you are interested in time
to solution, you almost certainly want to use a preconditioner.
Sometimes these do more work per byte so you may be able to see more
speedup without adding sockets.
Jed
********************************************************************
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.
********************************************************************
More information about the petsc-users
mailing list