[petsc-users] Read in sequential, solve in parallel

Moinier, Pierre (UK) Pierre.Moinier at baesystems.com
Wed Sep 29 07:51:55 CDT 2010


Jed,

The matrix is 1000000x1000000 and I have 4996000 non zeros
Here is the output for a single proc:

 -bash-3.2$ cat  petsc.sub.o5134
Reading matrix completed.
Reading RHS completed.
Solving ... 
44.827695 seconds elapsed
************************************************************************
************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
-fCourier9' to print this document            ***
************************************************************************
************************************************

---------------------------------------------- PETSc Performance
Summary: ----------------------------------------------

./mm2petsc2 on a linux-gnu named comp01 with 1 processor, by moinier Wed
Sep 29 13:44:03 2010
Using Petsc Release Version 3.1.0, Patch 3, Fri Jun  4 15:34:52 CDT 2010

                         Max       Max/Min        Avg      Total 
Time (sec):           4.571e+01      1.00000   4.571e+01
Objects:              2.400e+01      1.00000   2.400e+01
Flops:                3.428e+10      1.00000   3.428e+10  3.428e+10
Flops/sec:            7.499e+08      1.00000   7.499e+08  7.499e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.700e+01      1.00000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N
--> 2N flops
                            and VecAXPY() for complex vectors of length
N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 4.5715e+01 100.0%  3.4280e+10 100.0%  0.000e+00
0.0%  0.000e+00        0.0%  6.000e+00  22.2% 

------------------------------------------------------------------------
------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush()
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this
phase
      %M - percent messages in this phase     %L - percent message
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
over all processors)
------------------------------------------------------------------------
------------------------------------------------
Event                Count      Time (sec)     Flops
--- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------
------------------------------------------------

--- Event Stage 0: Main Stage

MatMult             1633 1.0 1.6247e+01 1.0 1.47e+10 1.0 0.0e+00 0.0e+00
0.0e+00 36 43  0  0  0  36 43  0  0  0   904
MatAssemblyBegin       1 1.0 1.3158e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00  0  0  0  0  7   0  0  0  0 33     0
MatAssemblyEnd         1 1.0 5.1668e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
3.0e+00  0  0  0  0 11   0  0  0  0 50     0
MatLoad                1 1.0 7.9792e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
5.0e+00  2  0  0  0 19   2  0  0  0 83     0
VecDot              3266 1.0 4.4834e+00 1.0 6.53e+09 1.0 0.0e+00 0.0e+00
0.0e+00 10 19  0  0  0  10 19  0  0  0  1457
VecNorm             1634 1.0 1.2968e+01 1.0 3.27e+09 1.0 0.0e+00 0.0e+00
0.0e+00 28 10  0  0  0  28 10  0  0  0   252
VecCopy             1636 1.0 2.9524e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  6  0  0  0  0   6  0  0  0  0     0
VecSet                 1 1.0 1.7080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY             3266 1.0 5.5580e+00 1.0 6.53e+09 1.0 0.0e+00 0.0e+00
0.0e+00 12 19  0  0  0  12 19  0  0  0  1175
VecAYPX             1632 1.0 2.5961e+00 1.0 3.26e+09 1.0 0.0e+00 0.0e+00
0.0e+00  6 10  0  0  0   6 10  0  0  0  1257
VecAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecLoad                1 1.0 7.8766e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00  0  0  0  0  4   0  0  0  0 17     0
VecScatterBegin     1633 1.0 1.3146e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSetup               1 1.0 8.7240e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 4.4828e+01 1.0 3.43e+10 1.0 0.0e+00 0.0e+00
0.0e+00 98100  0  0  0  98100  0  0  0   765
PCSetUp                1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             1634 1.0 2.9503e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00  6  0  0  0  0   6  0  0  0  0     0
------------------------------------------------------------------------
------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants'
Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Viewer     2              2         1136     0
              Matrix     7              3         6560     0
                 Vec    10              2         2696     0
         Vec Scatter     1              0            0     0
           Index Set     2              2         1056     0
       Krylov Solver     1              0            0     0
      Preconditioner     1              0            0     0
========================================================================
================================================
Average time to get PetscTime(): 1.19209e-07
#PETSc Option Table entries:
-ksp_rtol 1.e-6
-ksp_type cg
-log_summary
-pc_type none
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Wed Sep  1 13:08:57 2010
Configure options: --known-level1-dcache-size=65536
--known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
--known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
--known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
--known-sizeof-long-long=8 --known-sizeof-float=4
--known-sizeof-double=8 --known-sizeof-size_t=8 --known-bits-per-byte=8
--known-sizeof-MPI_Comm=8 --known-sizeof-MPI_Fint=4
--known-mpi-long-double=1
--with-mpi-dir=/apps/utils/linux64/openmpi-1.4.1 --with-batch
--with-fc=0 --known-mpi-shared=1 --with-shared --with-debugging=0
COPTFLAGS=-O3 CXXOPTFLAGS=-O3 --with-clanguage=cxx
--download-c-blas-lapack=yes
-----------------------------------------
Libraries compiled on Wed Sep  1 13:09:50 BST 2010 on lnx102 
Machine characteristics: Linux lnx102 2.6.31.12-0.2-default #1 SMP
2010-03-16 21:25:39 +0100 x86_64 x86_64 x86_64 GNU/Linux 
Using PETSc directory: /home/atc/neilson/opt/petsc-3.1-p3
Using PETSc arch: linux-gnu-cxx-opt-lnx102
-----------------------------------------
Using C compiler: /apps/utils/linux64/openmpi-1.4.1/bin/mpicxx -Wall
-Wwrite-strings -Wno-strict-aliasing -O3   -fPIC   
Using Fortran compiler:    
-----------------------------------------
Using include paths:
-I/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/include
-I/home/atc/neilson/opt/petsc-3.1-p3/include
-I/apps/utils/linux64/openmpi-1.4.1/include  
------------------------------------------
Using C linker: /apps/utils/linux64/openmpi-1.4.1/bin/mpicxx -Wall
-Wwrite-strings -Wno-strict-aliasing -O3 
Using Fortran linker:  
Using libraries:
-Wl,-rpath,/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/l
ib -L/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/lib
-lpetsc       -lX11
-Wl,-rpath,/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/l
ib -L/home/atc/neilson/opt/petsc-3.1-p3/linux-gnu-cxx-opt-lnx102/lib
-lf2clapack -lf2cblas -lmpi_cxx -lstdc++ -ldl  
------------------------------------------


-----Original Message-----
From: petsc-users-bounces at mcs.anl.gov
[mailto:petsc-users-bounces at mcs.anl.gov] On Behalf Of Jed Brown
Sent: 29 September 2010 13:40
To: PETSc users list
Subject: Re: [petsc-users] Read in sequential, solve in parallel

                    *** WARNING ***

  This message has originated outside your organisation,
  either from an external partner or the Global Internet. 
      Keep this in mind if you answer this message.
 

On Wed, Sep 29, 2010 at 14:34, Moinier, Pierre (UK)
<Pierre.Moinier at baesystems.com> wrote:
> Jed,
>
> Thanks for your help and thanks also to all of the others who have 
> replied!. I made some progress and wrote a new code that runs in 
> parallel. However the results seems to show that the time requires to 
> solve the linear systems is the same whether I use 1, 2 or 4 
> processors... Surely I am missing something. I copied the code below. 
> For info, I run the executable as: ./test -ksp_type cg -ksp_rtol 1.e-6

> -pc_type none

How big is the matrix (dimensions and number of nonzeros)?  Run with
-log_summary and send the output.  This problem is mostly memory
bandwidth limited and a single core can saturate most of the memory bus
for a whole socket on most architectures.  If you are interested in time
to solution, you almost certainly want to use a preconditioner.
Sometimes these do more work per byte so you may be able to see more
speedup without adding sockets.

Jed

********************************************************************
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.
********************************************************************



More information about the petsc-users mailing list