<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi Matthew,<br>
<br>
You mention that the unbalanced events take 0.01% of the time and
speedup is terrible.
Where did you get this information? Are you referring to Global %T? As
for the speedup, do you look at the time reported by the "time" command
ie
<pre wrap="">63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
0maxresident)?
</pre>
I think you may be right. My school uses :<br>
<br>
<table id="table1" border="0" width="720">
<tbody>
<tr>
<td colspan="5">
<p align="justify"><font face="Arial">The Supercomputing &
Visualisation Unit, Computer Centre is pleased to announce the addition
of a new cluster of Linux-based compute servers, consisting of a total
of 64 servers (60 dual-core and 4 quad-core systems). Each of the
compute nodes in the cluster is equipped with the following
configurations:<br>
</font></p>
</td>
<td width="32"> </td>
</tr>
<tr>
<td width="31"> </td>
<td bgcolor="#f7f7f7" valign="top" width="57"><b><i><font
face="Arial">No of Nodes</font></i></b></td>
<td bgcolor="#f7f7f7" valign="top" width="223"><font face="Arial"><b><i>Processors</i></b></font></td>
<td bgcolor="#f7f7f7" valign="top" width="88"><i><b><font
face="Arial">Qty per node</font></b></i></td>
<td bgcolor="#f7f7f7" valign="top" width="129"><i><b><font
face="Arial">Total cores per node</font></b></i></td>
<td bgcolor="#f7f7f7" valign="top" width="130"><i><b><font
face="Arial">Memory per node</font></b></i></td>
<td width="32"> </td>
</tr>
<tr>
<td width="31"> </td>
<td bgcolor="#f7f7f7" valign="top" width="57"><font face="Arial">4</font></td>
<td bgcolor="#f7f7f7" valign="top" width="223"><font face="Arial">Quad-Core
Intel Xeon X5355</font></td>
<td bgcolor="#f7f7f7" valign="top" width="88"><font face="Arial">2</font></td>
<td bgcolor="#f7f7f7" valign="top" width="129"><font face="Arial">8</font></td>
<td bgcolor="#f7f7f7" valign="top" width="130"><font face="Arial">16
GB</font></td>
<td width="32"> </td>
</tr>
<tr>
<td width="31"> </td>
<td bgcolor="#f7f7f7" valign="top" width="57"><font face="Arial">60
</font></td>
<td bgcolor="#f7f7f7" valign="top" width="223"><font face="Arial">Dual-Core
Intel Xeon 5160</font></td>
<td bgcolor="#f7f7f7" valign="top" width="88"><font face="Arial">2</font></td>
<td bgcolor="#f7f7f7" valign="top" width="129"><font face="Arial">4</font></td>
<td bgcolor="#f7f7f7" valign="top" width="130"><font face="Arial">8
GB</font></td>
</tr>
</tbody>
</table>
<br>
When I run on 2 processors, it states I'm running on 2*atlas3-c45. So
does it mean I running on shared memory bandwidth? So does it mean if I
run on 4 processors, is it equivalent to using 2 memory pipes?<br>
<br>
I also got a reply from my school's engineer:<br>
<br>
<font size="2"><font color="#0000ff" face="Arial">For queue
mcore_parallel, LSF will assign the compute nodes automatically. To
most of applications, running with 2*atlas3-c45 and 2*atlas3-c50 may be
faster. However, it is not sure if 2*atlas3-c45 means to run the job
within one CPU on dual core, or with two CPUs on two separate cores.
This is not controllable.<br>
<br>
</font></font>So what can I do on my side to ensure speedup? I hope I
do not have to switch from PETSc to other solvers.<br>
<br>
Thanks lot!<br>
<br>
Matthew Knepley wrote:
<blockquote
cite="mid:a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com"
type="cite">
<pre wrap="">On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay <a class="moz-txt-link-rfc2396E" href="mailto:zonexo@gmail.com"><zonexo@gmail.com></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap=""> Hi,
I just tested the ex2f.F example, changing m and n to 600. Here's the
result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin,
MatGetOrdering and KSPSetup have ratios >>1. The time taken seems to be
faster as the processor increases, although speedup is not 1:1. I thought
that this example should scale well, shouldn't it? Is there something wrong
with my installation then?
</pre>
</blockquote>
<pre wrap=""><!---->
1) Notice that the events that are unbalanced take 0.01% of the time.
Not important.
2) The speedup really stinks. Even though this is a small problem. Are
you sure that
you are actually running on two processors with separate memory
pipes and not
on 1 dual core?
Matt
</pre>
<blockquote type="cite">
<pre wrap=""> Thank you.
1 processor:
Norm of error 0.3371E+01 iterations 1153
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed
Apr 16 10:03:12 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
revision: 414581156e67e55c761739b0deb119f7590d0f4b
Max Max/Min Avg Total
Time (sec): 1.222e+02 1.00000 1.222e+02
Objects: 4.400e+01 1.00000 4.400e+01
Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10
Flops/sec: 2.903e+08 1.00000 2.903e+08 2.903e+08
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 2.349e+03 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length N
--> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 1.2216e+02 100.0% 3.5466e+10 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 2.349e+03 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all
processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was run without the PreLoadBegin() #
# macros. To get timing results we always recommend #
# preloading. otherwise timing numbers may be #
# meaningless. #
##########################################################
Event Count Time (sec) Flops/sec
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00
0.0e+00 13 11 0 0 0 13 11 0 0 0 239
MatSolve 1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
0.0e+00 25 11 0 0 0 25 11 0 0 0 124
MatLUFactorNum 1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 89
MatILUFactorSym 1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecMDot 1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00
1.2e+03 27 36 0 0 49 27 36 0 0 49 392
VecNorm 1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00
1.2e+03 2 2 0 0 51 2 2 0 0 51 422
VecScale 1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00
0.0e+00 1 1 0 0 0 1 1 0 0 0 621
VecCopy 39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 81
VecMAXPY 1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00
0.0e+00 31 38 0 0 0 31 38 0 0 0 363
VecNormalize 1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00
1.2e+03 2 4 0 0 51 2 4 0 0 51 472
KSPGMRESOrthog 1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00
1.2e+03 56 72 0 0 49 56 72 0 0 49 376
KSPSetup 1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00
2.3e+03100100 0 0100 100100 0 0100 292
PCSetUp 1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 14
PCApply 1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
0.0e+00 25 11 0 0 0 25 11 0 0 0 124
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Matrix 2 2 54691212 0
Index Set 3 3 4321032 0
Vec 37 37 103708408 0
Krylov Solver 1 1 17216 0
Preconditioner 1 1 168 0
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008
Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
--with-batch=1 --with-mpi-shared=0
--with-mpi-include=/usr/local/topspin/mpi/mpich/include
--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
--with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
--with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
-----------------------------------------
Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
Using PETSc arch: atlas3-mpi
-----------------------------------------
85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (16major+46429minor)pagefaults 0swaps
Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
2 processors:
Norm of error 0.3231E+01 iterations 1177
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed
Apr 16 09:48:37 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
revision: 414581156e67e55c761739b0deb119f7590d0f4b
Max Max/Min Avg Total
Time (sec): 1.034e+02 1.00000 1.034e+02
Objects: 5.500e+01 1.00000 5.500e+01
Flops: 1.812e+10 1.00000 1.812e+10 3.625e+10
Flops/sec: 1.752e+08 1.00000 1.752e+08 3.504e+08
MPI Messages: 1.218e+03 1.00000 1.218e+03 2.436e+03
MPI Message Lengths: 5.844e+06 1.00000 4.798e+03 1.169e+07
MPI Reductions: 1.204e+03 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length N
--> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 1.0344e+02 100.0% 3.6250e+10 100.0% 2.436e+03 100.0%
4.798e+03 100.0% 2.407e+03 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all
processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was run without the PreLoadBegin() #
# macros. To get timing results we always recommend #
# preloading. otherwise timing numbers may be #
# meaningless. #
##########################################################
Event Count Time (sec) Flops/sec
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03
0.0e+00 11 11100100 0 11 11100100 0 315
MatSolve 1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00
0.0e+00 19 11 0 0 0 19 11 0 0 0 187
MatLUFactorNum 1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 39
MatILUFactorSym 1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
7.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecMDot 1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00
1.2e+03 37 36 0 0 49 37 36 0 0 49 323
VecNorm 1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00
1.2e+03 12 2 0 0 51 12 2 0 0 51 57
VecScale 1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00
0.0e+00 1 1 0 0 0 1 1 0 0 0 757
VecCopy 40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 272
VecMAXPY 1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00
0.0e+00 19 38 0 0 0 19 38 0 0 0 606
VecScatterBegin 1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03
0.0e+00 0 0100100 0 0 0100100 0 0
VecScatterEnd 1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecNormalize 1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00
1.2e+03 12 4 0 0 51 12 4 0 0 51 82
KSPGMRESOrthog 1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00
1.2e+03 55 72 0 0 49 55 72 0 0 49 457
KSPSetup 2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03
2.4e+03 99100100100100 99100100100100 352
PCSetUp 2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 21
PCSetUpOnBlocks 1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 21
PCApply 1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00
0.0e+00 20 11 0 0 0 20 11 0 0 0 174
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Matrix 4 4 34540820 0
Index Set 5 5 2164120 0
Vec 41 41 53315992 0
Vec Scatter 1 1 0 0
Krylov Solver 2 2 17216 0
Preconditioner 2 2 256 0
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
Average time for MPI_Barrier(): 8.10623e-07
Average time for zero size MPI_Send(): 2.98023e-06
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008
42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (18major+28609minor)pagefaults 0swaps
1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (18major+23666minor)pagefaults 0swaps
4 processors:
Norm of error 0.3090E+01 iterations 937
63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (16major+13520minor)pagefaults 0swaps
53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (15major+13414minor)pagefaults 0swaps
58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (17major+18383minor)pagefaults 0swaps
20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (14major+18392minor)pagefaults 0swaps
Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed
Apr 16 09:55:16 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
revision: 414581156e67e55c761739b0deb119f7590d0f4b
Max Max/Min Avg Total
Time (sec): 6.374e+01 1.00001 6.374e+01
Objects: 5.500e+01 1.00000 5.500e+01
Flops: 7.209e+09 1.00016 7.208e+09 2.883e+10
Flops/sec: 1.131e+08 1.00017 1.131e+08 4.524e+08
MPI Messages: 1.940e+03 2.00000 1.455e+03 5.820e+03
MPI Message Lengths: 9.307e+06 2.00000 4.798e+03 2.792e+07
MPI Reductions: 4.798e+02 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N
--> 2N flops
and VecAXPY() for complex vectors of length N
--> 8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 6.3737e+01 100.0% 2.8832e+10 100.0% 5.820e+03 100.0%
4.798e+03 100.0% 1.919e+03 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all
processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this
phase
%M - percent messages in this phase %L - percent message lengths
in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was run without the PreLoadBegin() #
# macros. To get timing results we always recommend #
# preloading. otherwise timing numbers may be #
# meaningless. #
##########################################################
Event Count Time (sec) Flops/sec
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03
0.0e+00 8 11100100 0 8 11100100 0 321
MatSolve 969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00
0.0e+00 11 11 0 0 0 11 11 0 0 0 220
MatLUFactorNum 1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 62
MatILUFactorSym 1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
1.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03
7.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecMDot 937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00
9.4e+02 48 36 0 0 49 48 36 0 0 49 292
VecNorm 970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00
9.7e+02 18 2 0 0 51 18 2 0 0 51 49
VecScale 969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00
0.0e+00 0 1 0 0 0 0 1 0 0 0 2220
VecCopy 32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 2185
VecMAXPY 969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00
0.0e+00 11 38 0 0 0 11 38 0 0 0 747
VecScatterBegin 969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03
0.0e+00 0 0100100 0 0 0100100 0 0
VecScatterEnd 969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecNormalize 969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00
9.7e+02 18 4 0 0 50 18 4 0 0 50 72
KSPGMRESOrthog 937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00
9.4e+02 59 72 0 0 49 59 72 0 0 49 521
KSPSetup 2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03
1.9e+03 98100100100 99 98100100100 99 461
PCSetUp 2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 45
PCSetUpOnBlocks 1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00
3.0e+00 0 0 0 0 0 0 0 0 0 0 45
PCApply 969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00
0.0e+00 12 11 0 0 0 12 11 0 0 0 203
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Matrix 4 4 17264420 0
Index Set 5 5 1084120 0
Vec 41 41 26675992 0
Vec Scatter 1 1 0 0
Krylov Solver 2 2 17216 0
Preconditioner 2 2 256 0
========================================================================================================================
Average time to get PetscTime(): 1.90735e-07
Average time for MPI_Barrier(): 6.00815e-06
Average time for zero size MPI_Send(): 5.42402e-05
OptionTable: -log_summary
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Tue Jan 8 22:22:08 2008
Matthew Knepley wrote:
The convergence here is jsut horrendous. Have you tried using LU to check
your implementation? All the time is in the solve right now. I would first
try a direct method (at least on a small problem) and then try to understand
the convergence behavior. MUMPS can actually scale very well for big
problems.
Matt
</pre>
</blockquote>
<pre wrap=""><!---->
</pre>
</blockquote>
</body>
</html>