<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Hi Matthew,<br>

<br>

You mention that the unbalanced events take 0.01% of the time and

speedup is terrible.

Where did you get this information? Are you referring to Global %T? As

for the speedup, do you look at the time reported by the "time" command

ie

<pre wrap="">63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata

0maxresident)?

</pre>

I think you may be right. My school uses :<br>

<br>

<table id="table1" border="0" width="720">

  <tbody>

    <tr>

      <td colspan="5">

      <p align="justify"><font face="Arial">The Supercomputing &amp;

Visualisation Unit, Computer Centre is pleased to announce the addition

of a new cluster of Linux-based compute servers, consisting of a total

of 64 servers (60 dual-core and 4 quad-core systems).&nbsp; Each of the

compute nodes in the cluster is equipped with the following

configurations:<br>

&nbsp;</font></p>

      </td>

      <td width="32">&nbsp;</td>

    </tr>

    <tr>

      <td width="31">&nbsp;</td>

      <td bgcolor="#f7f7f7" valign="top" width="57"><b><i><font

 face="Arial">No of Nodes</font></i></b></td>

      <td bgcolor="#f7f7f7" valign="top" width="223"><font face="Arial"><b><i>Processors</i></b></font></td>

      <td bgcolor="#f7f7f7" valign="top" width="88"><i><b><font

 face="Arial">Qty per node</font></b></i></td>

      <td bgcolor="#f7f7f7" valign="top" width="129"><i><b><font

 face="Arial">Total cores per node</font></b></i></td>

      <td bgcolor="#f7f7f7" valign="top" width="130"><i><b><font

 face="Arial">Memory per node</font></b></i></td>

      <td width="32">&nbsp;</td>

    </tr>

    <tr>

      <td width="31">&nbsp;</td>

      <td bgcolor="#f7f7f7" valign="top" width="57"><font face="Arial">4</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="223"><font face="Arial">Quad-Core

Intel Xeon X5355</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="88"><font face="Arial">2</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="129"><font face="Arial">8</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="130"><font face="Arial">16

GB</font></td>

      <td width="32">&nbsp;</td>

    </tr>

    <tr>

      <td width="31">&nbsp;</td>

      <td bgcolor="#f7f7f7" valign="top" width="57"><font face="Arial">60

      </font></td>

      <td bgcolor="#f7f7f7" valign="top" width="223"><font face="Arial">Dual-Core

Intel Xeon 5160</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="88"><font face="Arial">2</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="129"><font face="Arial">4</font></td>

      <td bgcolor="#f7f7f7" valign="top" width="130"><font face="Arial">8

GB</font></td>

    </tr>

  </tbody>

</table>

<br>

When I run on 2 processors, it states I'm running on 2*atlas3-c45. So

does it mean I running on shared memory bandwidth? So does it mean if I

run on 4 processors, is it equivalent to using 2 memory pipes?<br>

<br>

I also got a reply from my school's engineer:<br>

<br>

<font size="2"><font color="#0000ff" face="Arial">For queue

mcore_parallel, LSF will assign the compute nodes automatically. To

most of applications, running with 2*atlas3-c45 and 2*atlas3-c50 may be

faster. However, it is not sure if 2*atlas3-c45 means to run the job

within one CPU on dual core, or with two CPUs on two separate cores.

This is not controllable.<br>

<br>

</font></font>So what can I do on my side to ensure speedup? I hope I

do not have to switch from PETSc to other solvers.<br>

<br>

Thanks lot!<br>

<br>

Matthew Knepley wrote:

<blockquote

 cite="mid:a9f269830804151920n3ede1433rdb231a2f7a88890d@mail.gmail.com"

 type="cite">

  <pre wrap="">On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay <a class="moz-txt-link-rfc2396E" href="mailto:zonexo@gmail.com">&lt;zonexo@gmail.com&gt;</a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap=""> Hi,

 I just tested the ex2f.F example, changing m and n to 600. Here's the

result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin,

MatGetOrdering and KSPSetup have ratios &gt;&gt;1. The time taken seems to be

faster as the processor increases, although speedup is not 1:1. I thought

that this example should scale well, shouldn't it? Is there something wrong

with my installation then?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

1) Notice that the events that are unbalanced take 0.01% of the time.

Not important.

2) The speedup really stinks. Even though this is a small problem. Are

you sure that

     you are actually running on two processors with separate memory

pipes and not

     on 1 dual core?

    Matt

  </pre>

  <blockquote type="cite">

    <pre wrap=""> Thank you.

 1 processor:

 Norm of error 0.3371E+01 iterations  1153

************************************************************************************************************************

 ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r

-fCourier9' to print this document            ***

************************************************************************************************************************

 ---------------------------------------------- PETSc Performance Summary:

----------------------------------------------

 ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed

Apr 16 10:03:12 2008

 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG

revision: 414581156e67e55c761739b0deb119f7590d0f4b

                          Max       Max/Min        Avg      Total

 Time (sec):           1.222e+02      1.00000   1.222e+02

 Objects:              4.400e+01      1.00000   4.400e+01

 Flops:                3.547e+10      1.00000   3.547e+10  3.547e+10

 Flops/sec:            2.903e+08      1.00000   2.903e+08  2.903e+08

 MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00

 MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00

 MPI Reductions:       2.349e+03      1.00000

 Flop counting convention: 1 flop = 1 real number operation of type

(multiply/divide/add/subtract)

                             e.g., VecAXPY() for real vectors of length N

--&gt; 2N flops

                             and VecAXPY() for complex vectors of length N

--&gt; 8N flops

 Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---

-- Message Lengths --  -- Reductions --

                         Avg     %Total     Avg     %Total   counts   %Total

Avg         %Total   counts   %Total

  0:      Main Stage: 1.2216e+02 100.0%  3.5466e+10 100.0%  0.000e+00   0.0%

0.000e+00        0.0%  2.349e+03 100.0%

------------------------------------------------------------------------------------------------------------------------

 See the 'Profiling' chapter of the users' manual for details on

interpreting output.

 Phase summary info:

    Count: number of times phase was executed

    Time and Flops/sec: Max - maximum over all processors

                        Ratio - ratio of maximum to minimum over all

processors

    Mess: number of messages sent

    Avg. len: average message length

    Reduct: number of global reductions

    Global: entire computation

    Stage: stages of a computation. Set stages with PetscLogStagePush() and

PetscLogStagePop().

       %T - percent time in this phase         %F - percent flops in this

phase

       %M - percent messages in this phase     %L - percent message lengths

in this phase

       %R - percent reductions in this phase

    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over

all processors)

------------------------------------------------------------------------------------------------------------------------

       ##########################################################

       #                                                        #

       #                          WARNING!!!                    #

       #                                                        #

       #   This code was run without the PreLoadBegin()         #

       #   macros. To get timing results we always recommend    #

       #   preloading. otherwise timing numbers may be          #

       #   meaningless.                                         #

       ##########################################################

 Event                Count      Time (sec)     Flops/sec

--- Global ---  --- Stage ---   Total

                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len

Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s

------------------------------------------------------------------------------------------------------------------------

 --- Event Stage 0: Main Stage

 MatMult             1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00

0.0e+00 13 11  0  0  0  13 11  0  0  0   239

 MatSolve            1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00

0.0e+00 25 11  0  0  0  25 11  0  0  0   124

 MatLUFactorNum         1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0    89

 MatILUFactorSym        1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

1.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatAssemblyBegin       1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatAssemblyEnd         1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatGetOrdering         1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

2.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecMDot             1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00

1.2e+03 27 36  0  0 49  27 36  0  0 49   392

 VecNorm             1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00

1.2e+03  2  2  0  0 51   2  2  0  0 51   422

 VecScale            1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00

0.0e+00  1  1  0  0  0   1  1  0  0  0   621

 VecCopy               39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecSet                41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecAXPY               78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00

0.0e+00  1  0  0  0  0   1  0  0  0  0    81

 VecMAXPY            1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00

0.0e+00 31 38  0  0  0  31 38  0  0  0   363

 VecNormalize        1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00

1.2e+03  2  4  0  0 51   2  4  0  0 51   472

 KSPGMRESOrthog      1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00

1.2e+03 56 72  0  0 49  56 72  0  0 49   376

 KSPSetup               1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 KSPSolve               1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00

2.3e+03100100  0  0100 100100  0  0100   292

 PCSetUp                1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00

3.0e+00  0  0  0  0  0   0  0  0  0  0    14

 PCApply             1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00

0.0e+00 25 11  0  0  0  25 11  0  0  0   124

------------------------------------------------------------------------------------------------------------------------

 Memory usage is given in bytes:

 Object Type          Creations   Destructions   Memory  Descendants' Mem.

 --- Event Stage 0: Main Stage

               Matrix     2              2   54691212     0

            Index Set     3              3    4321032     0

                  Vec    37             37  103708408     0

        Krylov Solver     1              1      17216     0

       Preconditioner     1              1        168     0

========================================================================================================================

 Average time to get PetscTime(): 1.90735e-07

 OptionTable: -log_summary

 Compiled without FORTRAN kernels

 Compiled with full precision matrices (default)

 sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8

sizeof(PetscScalar) 8

 Configure run at: Tue Jan  8 22:22:08 2008

 Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8

--sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8

--sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4

--sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0

--with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0

--with-batch=1 --with-mpi-shared=0

--with-mpi-include=/usr/local/topspin/mpi/mpich/include

--with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a

--with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun

--with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0

 -----------------------------------------

 Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01

 Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12

23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux

 Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8

 Using PETSc arch: atlas3-mpi

 -----------------------------------------

 85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata

0maxresident)k

 0inputs+0outputs (16major+46429minor)pagefaults 0swaps

 Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary

 2 processors:

 Norm of error 0.3231E+01 iterations  1177

************************************************************************************************************************

 ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r

-fCourier9' to print this document            ***

************************************************************************************************************************

 ---------------------------------------------- PETSc Performance Summary:

----------------------------------------------

 ./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed

Apr 16 09:48:37 2008

 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG

revision: 414581156e67e55c761739b0deb119f7590d0f4b

                          Max       Max/Min        Avg      Total

 Time (sec):           1.034e+02      1.00000   1.034e+02

 Objects:              5.500e+01      1.00000   5.500e+01

 Flops:                1.812e+10      1.00000   1.812e+10  3.625e+10

 Flops/sec:            1.752e+08      1.00000   1.752e+08  3.504e+08

 MPI Messages:         1.218e+03      1.00000   1.218e+03  2.436e+03

 MPI Message Lengths:  5.844e+06      1.00000   4.798e+03  1.169e+07

 MPI Reductions:       1.204e+03      1.00000

 Flop counting convention: 1 flop = 1 real number operation of type

(multiply/divide/add/subtract)

                             e.g., VecAXPY() for real vectors of length N

--&gt; 2N flops

                             and VecAXPY() for complex vectors of length N

--&gt; 8N flops

 Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---

-- Message Lengths --  -- Reductions --

                         Avg     %Total     Avg     %Total   counts   %Total

Avg         %Total   counts   %Total

  0:      Main Stage: 1.0344e+02 100.0%  3.6250e+10 100.0%  2.436e+03 100.0%

4.798e+03      100.0%  2.407e+03 100.0%

------------------------------------------------------------------------------------------------------------------------

 See the 'Profiling' chapter of the users' manual for details on

interpreting output.

 Phase summary info:

    Count: number of times phase was executed

    Time and Flops/sec: Max - maximum over all processors

                        Ratio - ratio of maximum to minimum over all

processors

    Mess: number of messages sent

    Avg. len: average message length

    Reduct: number of global reductions

    Global: entire computation

    Stage: stages of a computation. Set stages with PetscLogStagePush() and

PetscLogStagePop().

       %T - percent time in this phase         %F - percent flops in this

phase

       %M - percent messages in this phase     %L - percent message lengths

in this phase

       %R - percent reductions in this phase

    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over

all processors)

------------------------------------------------------------------------------------------------------------------------

       ##########################################################

       #                                                        #

       #                          WARNING!!!                    #

       #                                                        #

       #   This code was run without the PreLoadBegin()         #

       #   macros. To get timing results we always recommend    #

       #   preloading. otherwise timing numbers may be          #

       #   meaningless.                                         #

       ##########################################################

 Event                Count      Time (sec)     Flops/sec

--- Global ---  --- Stage ---   Total

                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len

Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s

------------------------------------------------------------------------------------------------------------------------

 --- Event Stage 0: Main Stage

 MatMult             1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03

0.0e+00 11 11100100  0  11 11100100  0   315

 MatSolve            1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00

0.0e+00 19 11  0  0  0  19 11  0  0  0   187

 MatLUFactorNum         1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0    39

 MatILUFactorSym        1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00

1.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatAssemblyBegin       1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00

2.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatAssemblyEnd         1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03

7.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatGetRowIJ            1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatGetOrdering         1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00

2.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecMDot             1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00

1.2e+03 37 36  0  0 49  37 36  0  0 49   323

 VecNorm             1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00

1.2e+03 12  2  0  0 51  12  2  0  0 51    57

 VecScale            1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00

0.0e+00  1  1  0  0  0   1  1  0  0  0   757

 VecCopy               40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecSet              1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  1  0  0  0  0   1  0  0  0  0     0

 VecAXPY               80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0   272

 VecMAXPY            1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00

0.0e+00 19 38  0  0  0  19 38  0  0  0   606

 VecScatterBegin     1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03

0.0e+00  0  0100100  0   0  0100100  0     0

 VecScatterEnd       1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  1  0  0  0  0   1  0  0  0  0     0

 VecNormalize        1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00

1.2e+03 12  4  0  0 51  12  4  0  0 51    82

 KSPGMRESOrthog      1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00

1.2e+03 55 72  0  0 49  55 72  0  0 49   457

 KSPSetup               2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 KSPSolve               1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03

2.4e+03 99100100100100  99100100100100   352

 PCSetUp                2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00

3.0e+00  0  0  0  0  0   0  0  0  0  0    21

 PCSetUpOnBlocks        1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00

3.0e+00  0  0  0  0  0   0  0  0  0  0    21

 PCApply             1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00

0.0e+00 20 11  0  0  0  20 11  0  0  0   174

------------------------------------------------------------------------------------------------------------------------

 Memory usage is given in bytes:

 Object Type          Creations   Destructions   Memory  Descendants' Mem.

 --- Event Stage 0: Main Stage

               Matrix     4              4   34540820     0

            Index Set     5              5    2164120     0

                  Vec    41             41   53315992     0

          Vec Scatter     1              1          0     0

        Krylov Solver     2              2      17216     0

       Preconditioner     2              2        256     0

========================================================================================================================

 Average time to get PetscTime(): 1.90735e-07

 Average time for MPI_Barrier(): 8.10623e-07

 Average time for zero size MPI_Send(): 2.98023e-06

 OptionTable: -log_summary

 Compiled without FORTRAN kernels

 Compiled with full precision matrices (default)

 sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8

sizeof(PetscScalar) 8

 Configure run at: Tue Jan  8 22:22:08 2008

 42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata

0maxresident)k

 0inputs+0outputs (18major+28609minor)pagefaults 0swaps

 1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

 0inputs+0outputs (18major+23666minor)pagefaults 0swaps

 4 processors:

 Norm of error 0.3090E+01 iterations   937

 63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata

0maxresident)k

 0inputs+0outputs (16major+13520minor)pagefaults 0swaps

 53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata

0maxresident)k

 0inputs+0outputs (15major+13414minor)pagefaults 0swaps

 58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata

0maxresident)k

 0inputs+0outputs (17major+18383minor)pagefaults 0swaps

 20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata

0maxresident)k

 0inputs+0outputs (14major+18392minor)pagefaults 0swaps

 Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary

************************************************************************************************************************

 ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r

-fCourier9' to print this document            ***

************************************************************************************************************************

 ---------------------------------------------- PETSc Performance Summary:

----------------------------------------------

 ./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed

Apr 16 09:55:16 2008

 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG

revision: 414581156e67e55c761739b0deb119f7590d0f4b

                          Max       Max/Min        Avg      Total

 Time (sec):           6.374e+01      1.00001   6.374e+01

 Objects:              5.500e+01      1.00000   5.500e+01

 Flops:                7.209e+09      1.00016   7.208e+09  2.883e+10

 Flops/sec:            1.131e+08      1.00017   1.131e+08  4.524e+08

 MPI Messages:         1.940e+03      2.00000   1.455e+03  5.820e+03

 MPI Message Lengths:  9.307e+06      2.00000   4.798e+03  2.792e+07

 MPI Reductions:       4.798e+02      1.00000

 Flop counting convention: 1 flop = 1 real number operation of type

(multiply/divide/add/subtract)

                             e.g., VecAXPY() for real vectors of length N

--&gt; 2N flops

                             and VecAXPY() for complex vectors of length N

--&gt; 8N flops

 Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---

-- Message Lengths --  -- Reductions --

                         Avg     %Total     Avg     %Total   counts   %Total

Avg         %Total   counts   %Total

  0:      Main Stage: 6.3737e+01 100.0%  2.8832e+10 100.0%  5.820e+03 100.0%

4.798e+03      100.0%  1.919e+03 100.0%

------------------------------------------------------------------------------------------------------------------------

 See the 'Profiling' chapter of the users' manual for details on

interpreting output.

 Phase summary info:

    Count: number of times phase was executed

    Time and Flops/sec: Max - maximum over all processors

                        Ratio - ratio of maximum to minimum over all

processors

    Mess: number of messages sent

    Avg. len: average message length

    Reduct: number of global reductions

    Global: entire computation

    Stage: stages of a computation. Set stages with PetscLogStagePush() and

PetscLogStagePop().

       %T - percent time in this phase         %F - percent flops in this

phase

       %M - percent messages in this phase     %L - percent message lengths

in this phase

       %R - percent reductions in this phase

    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over

all processors)

------------------------------------------------------------------------------------------------------------------------

       ##########################################################

       #                                                        #

       #                          WARNING!!!                    #

       #                                                        #

       #   This code was run without the PreLoadBegin()         #

       #   macros. To get timing results we always recommend    #

       #   preloading. otherwise timing numbers may be          #

       #   meaningless.                                         #

       ##########################################################

 Event                Count      Time (sec)     Flops/sec

--- Global ---  --- Stage ---   Total

                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len

Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s

------------------------------------------------------------------------------------------------------------------------

 --- Event Stage 0: Main Stage

 MatMult              969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03

0.0e+00  8 11100100  0   8 11100100  0   321

 MatSolve             969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00

0.0e+00 11 11  0  0  0  11 11  0  0  0   220

 MatLUFactorNum         1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0    62

 MatILUFactorSym        1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00

1.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatAssemblyBegin       1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00

2.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatAssemblyEnd         1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03

7.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatGetRowIJ            1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 MatGetOrdering         1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00

2.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecMDot              937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00

9.4e+02 48 36  0  0 49  48 36  0  0 49   292

 VecNorm              970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00

9.7e+02 18  2  0  0 51  18  2  0  0 51    49

 VecScale             969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00

0.0e+00  0  1  0  0  0   0  1  0  0  0  2220

 VecCopy               32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 VecSet              1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  1  0  0  0  0   1  0  0  0  0     0

 VecAXPY               64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0  2185

 VecMAXPY             969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00

0.0e+00 11 38  0  0  0  11 38  0  0  0   747

 VecScatterBegin      969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03

0.0e+00  0  0100100  0   0  0100100  0     0

 VecScatterEnd        969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  1  0  0  0  0   1  0  0  0  0     0

 VecNormalize         969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00

9.7e+02 18  4  0  0 50  18  4  0  0 50    72

 KSPGMRESOrthog       937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00

9.4e+02 59 72  0  0 49  59 72  0  0 49   521

 KSPSetup               2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00

0.0e+00  0  0  0  0  0   0  0  0  0  0     0

 KSPSolve               1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03

1.9e+03 98100100100 99  98100100100 99   461

 PCSetUp                2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00

3.0e+00  0  0  0  0  0   0  0  0  0  0    45

 PCSetUpOnBlocks        1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00

3.0e+00  0  0  0  0  0   0  0  0  0  0    45

 PCApply              969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00

0.0e+00 12 11  0  0  0  12 11  0  0  0   203

------------------------------------------------------------------------------------------------------------------------

 Memory usage is given in bytes:

 Object Type          Creations   Destructions   Memory  Descendants' Mem.

 --- Event Stage 0: Main Stage

               Matrix     4              4   17264420     0

            Index Set     5              5    1084120     0

                  Vec    41             41   26675992     0

          Vec Scatter     1              1          0     0

        Krylov Solver     2              2      17216     0

       Preconditioner     2              2        256     0

========================================================================================================================

 Average time to get PetscTime(): 1.90735e-07

 Average time for MPI_Barrier(): 6.00815e-06

 Average time for zero size MPI_Send(): 5.42402e-05

 OptionTable: -log_summary

 Compiled without FORTRAN kernels

 Compiled with full precision matrices (default)

 sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8

sizeof(PetscScalar) 8

 Configure run at: Tue Jan  8 22:22:08 2008

 Matthew Knepley wrote:

 The convergence here is jsut horrendous. Have you tried using LU to check

your implementation? All the time is in the solve right now. I would first

try a direct method (at least on a small problem) and then try to understand

the convergence behavior. MUMPS can actually scale very well for big

problems.

 Matt

    </pre>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

</body>

</html>