[petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

Wed Jun 13 13:09:31 CDT 2018

Mark,
  Yes, it is 7-point stencil. I tried your options,
-pc_gamg_square_graph 0 -pc_gamg_threshold
0.0 -pc_gamg_repartition, and found they increased the time. I did not try
hypre since I don't know how to set its options.
  I also tried periodic boundary condition and ran it with -mat_view
::load_balance. It gives fewer KSP iterations and but PETSc still reports
load imbalance at coarse levels.

--Junchao Zhang

On Tue, Jun 12, 2018 at 3:17 PM, Mark Adams <mfadams at lbl.gov> wrote:

> This all looks reasonable to me. The VecScatters times are a little high
> but these are fast little solves (.2 seconds each).
>
> The RAP times are very low, suggesting we could optimize parameters a bit
> and reduce the iteration count. These are 7 point stencils as I recall. You
> could try -pc_gamg_square_graph 0 (instead of 1) and you probably want
> '-pc_gamg_threshold 0.0'.  You could also test hypre.
>
> And you should be able to improve coarse grid load imbalance with
> -pc_gamg_repartition.
>
> Mark
>
> On Tue, Jun 12, 2018 at 12:32 PM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
>
>> Mark,
>>   I tried "-pc_gamg_type agg ..." options you mentioned, and also
>> -ksp_type cg + PETSc's default PC bjacobi. In the latter case, to reduce
>> execution time I called KSPSolve 100 times instead of 1000, and also
>> used -ksp_max_it 100. In the 36x48=1728 ranks case, I also did a test with
>> -log_sync. From there you can see a lot of time is spent on VecNormBarrier,
>> which implies load imbalance. Note VecScatterBarrie time is misleading,
>> since it barriers ALL ranks, but in reality VecScatter sort of syncs in
>> a small neighborhood.
>>   Barry suggested trying periodic boundary condition so that the nonzeros
>> are perfectly balanced across processes. I will try that to see what
>> happens.
>>
>> --Junchao Zhang
>>
>> On Mon, Jun 11, 2018 at 8:09 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:46 AM, Junchao Zhang <jczhang at mcs.anl.gov>
>>> wrote:
>>>
>>>> I used an LCRC machine named Bebop. I tested on its Intel Broadwell
>>>> nodes. Each nodes has 2 CPUs and 36 cores in total. I collected data using
>>>> 36 cores in a node or 18 cores in a node.  As you can see, 18 cores/node
>>>> gave much better performance, which is reasonable as routines like MatSOR,
>>>> MatMult, MatMultAdd are all bandwidth bound.
>>>>
>>>> The code uses a DMDA 3D grid, 7-point stencil, and defines
>>>> nodes(vertices) at the surface or second to the surface as boundary nodes.
>>>> Boundary nodes only have a diagonal one in their row in the matrix.
>>>> Interior nodes have 7 nonzeros in their row. Boundary processors in the
>>>> processor grid has less nonzero. This is one source of load-imbalance. Will
>>>> load-imbalance get severer at coarser grids of an MG level?
>>>>
>>>
>>> Yes.
>>>
>>> You can use a simple Jacobi solver to see the basic performance of your
>>> operator and machine. Do you see as much time spent in Vec Scatters?
>>> VecAXPY? etc.
>>>
>>>
>>>>
>>>> I attach a trace view figure that show activity of each ranks along the
>>>> time axis in one KSPSove. White color means MPI wait. You can see
>>>> white takes a large space.
>>>>
>>>> I don't have a good explanation why at large scale (1728 cores),
>>>> processors wait longer time, as the communication pattern is still 7-point
>>>> stencil in a cubic processor gird.
>>>>
>>>> --Junchao Zhang
>>>>
>>>> On Sat, Jun 9, 2018 at 11:32 AM, Smith, Barry F. <bsmith at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>>
>>>>>   Junchao,
>>>>>
>>>>>       Thanks, the load balance of matrix entries is remarkably similar
>>>>> for the two runs so it can't be a matter of worse work load imbalance for
>>>>> SOR for the larger case explaining why the SOR takes more time.
>>>>>
>>>>>       Here is my guess (and I know no way to confirm it). In the
>>>>> smaller case the overlap of different processes on the same node running
>>>>> SOR at the same time is lower than the larger case hence the larger case is
>>>>> slower because there are more SOR processes fighting over the same memory
>>>>> bandwidth at the same time than in the smaller case.   Ahh, here is
>>>>> something you can try, lets undersubscribe the memory bandwidth needs, run
>>>>> on say 16 processes per node with 8 nodes and 16 processes per node with 64
>>>>> nodes and send the two -log_view output files. I assume this is an LCRC
>>>>> machine and NOT a KNL system?
>>>>>
>>>>>    Thanks
>>>>>
>>>>>
>>>>>    Barry
>>>>>
>>>>>
>>>>> > On Jun 9, 2018, at 8:29 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>> >
>>>>> > -pc_gamg_type classical
>>>>> >
>>>>> > FYI, we only support smoothed aggregation "agg" (the default). (This
>>>>> thread started by saying you were using GAMG.)
>>>>> >
>>>>> > It is not clear how much this will make a difference for you, but
>>>>> you don't want to use classical because we do not support it. It is meant
>>>>> as a reference implementation for developers.
>>>>> >
>>>>> > First, how did you get the idea to use classical? If the
>>>>> documentation lead you to believe this was a good thing to do then we need
>>>>> to fix that!
>>>>> >
>>>>> > Anyway, here is a generic input for GAMG:
>>>>> >
>>>>> > -pc_type gamg
>>>>> > -pc_gamg_type agg
>>>>> > -pc_gamg_agg_nsmooths 1
>>>>> > -pc_gamg_coarse_eq_limit 1000
>>>>> > -pc_gamg_reuse_interpolation true
>>>>> > -pc_gamg_square_graph 1
>>>>> > -pc_gamg_threshold 0.05
>>>>> > -pc_gamg_threshold_scale .0
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Jun 7, 2018 at 6:52 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>>> wrote:
>>>>> > OK, I have thought that space was a typo. btw, this option does not
>>>>> show up in -h.
>>>>> > I changed number of ranks to use all cores on each node to avoid
>>>>> misleading ratio in -log_view. Since one node has 36 cores, I ran with
>>>>> 6^3=216 ranks, and 12^3=1728 ranks. I also found call counts of MatSOR etc
>>>>> in the two tests were different. So they are not strict weak scaling tests.
>>>>> I tried to add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the
>>>>> two have the same MatSOR count. Anyway, I attached the load balance output.
>>>>> >
>>>>> > I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
>>>>> indirectly calls MatSOR_MPIAIJ. I believe the following code in
>>>>> MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
>>>>> processors through VecScatter at each MG level. If SOR and MatMultAdd are
>>>>> imbalanced, the cost is accumulated along MG levels and shows up as large
>>>>> VecScatter cost.
>>>>> > 1460:     while
>>>>> >  (its--) {
>>>>> >
>>>>> > 1461:       VecScatterBegin(mat->Mvctx,xx
>>>>> ,mat->lvec,INSERT_VALUES,SCATTER_FORWARD
>>>>> > );
>>>>> >
>>>>> > 1462:       VecScatterEnd(mat->Mvctx,xx,m
>>>>> at->lvec,INSERT_VALUES,SCATTER_FORWARD
>>>>> > );
>>>>> >
>>>>> >
>>>>> > 1464:       /* update rhs: bb1 = bb - B*x */
>>>>> > 1465:       VecScale
>>>>> > (mat->lvec,-1.0);
>>>>> >
>>>>> > 1466:       (*mat->B->ops->multadd)(mat->
>>>>> > B,mat->lvec,bb,bb1);
>>>>> >
>>>>> >
>>>>> > 1468:       /* local sweep */
>>>>> > 1469:       (*mat->A->ops->sor)(mat->A,bb
>>>>> 1,omega,SOR_SYMMETRIC_SWEEP,
>>>>> > fshift,lits,1,xx);
>>>>> >
>>>>> > 1470:     }
>>>>> >
>>>>> >
>>>>> >
>>>>> > --Junchao Zhang
>>>>> >
>>>>> > On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <bsmith at mcs.anl.gov>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> > > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <jczhang at mcs.anl.gov>
>>>>> wrote:
>>>>> > >
>>>>> > > Searched but could not find this option, -mat_view::load_balance
>>>>> >
>>>>> >    There is a space between the view and the :   load_balance is a
>>>>> particular viewer format that causes the printing of load balance
>>>>> information about number of nonzeros in the matrix.
>>>>> >
>>>>> >    Barry
>>>>> >
>>>>> > >
>>>>> > > --Junchao Zhang
>>>>> > >
>>>>> > > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <
>>>>> bsmith at mcs.anl.gov> wrote:
>>>>> > >  So the only surprise in the results is the SOR. It is
>>>>> embarrassingly parallel and normally one would not see a jump.
>>>>> > >
>>>>> > >  The load balance for SOR time 1.5 is better at 1000 processes
>>>>> than for 125 processes of 2.1  not worse so this number doesn't easily
>>>>> explain it.
>>>>> > >
>>>>> > >  Could you run the 125 and 1000 with -mat_view ::load_balance and
>>>>> see what you get out?
>>>>> > >
>>>>> > >    Thanks
>>>>> > >
>>>>> > >      Barry
>>>>> > >
>>>>> > >  Notice that the MatSOR time jumps a lot about 5 secs when the
>>>>> -log_sync is on. My only guess is that the MatSOR is sharing memory
>>>>> bandwidth (or some other resource? cores?) with the VecScatter and for some
>>>>> reason this is worse for 1000 cores but I don't know why.
>>>>> > >
>>>>> > > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>>> wrote:
>>>>> > > >
>>>>> > > > Hi, PETSc developers,
>>>>> > > >  I tested Michael Becker's code. The code calls the same
>>>>> KSPSolve 1000 times in the second stage and needs cubic number of
>>>>> processors to run. I ran with 125 ranks and 1000 ranks, with or without
>>>>> -log_sync option. I attach the log view output files and a scaling loss
>>>>> excel file.
>>>>> > > >  I profiled the code with 125 processors. It looks {MatSOR,
>>>>> MatMult, MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c
>>>>> took ~50% of the time,  The other half time was spent on waiting in MPI.
>>>>> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
>>>>> > > >  I tested it on a 36 cores/node machine. I found 32 ranks/node
>>>>> gave better performance (about 10%) than 36 ranks/node in the 125 ranks
>>>>> testing.  I guess it is because processors in the former had more balanced
>>>>> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
>>>>> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
>>>>> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
>>>>> > > >  Does anyone familiar with the algorithm have quick explanations?
>>>>> > > >
>>>>> > > > --Junchao Zhang
>>>>> > > >
>>>>> > > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
>>>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>>> > > > Hello again,
>>>>> > > >
>>>>> > > > this took me longer than I anticipated, but here we go.
>>>>> > > > I did reruns of the cases where only half the processes per node
>>>>> were used (without -log_sync):
>>>>> > > >
>>>>> > > >                     125 procs,1st           125 procs,2nd
>>>>>   1000 procs,1st          1000 procs,2nd
>>>>> > > >                   Max        Ratio        Max        Ratio
>>>>>   Max        Ratio        Max        Ratio
>>>>> > > > KSPSolve           1.203E+02    1.0        1.210E+02    1.0
>>>>>   1.399E+02    1.1        1.365E+02    1.0
>>>>> > > > VecTDot            6.376E+00    3.7        6.551E+00    4.0
>>>>>   7.885E+00    2.9        7.175E+00    3.4
>>>>> > > > VecNorm            4.579E+00    7.1        5.803E+00   10.2
>>>>>   8.534E+00    6.9        6.026E+00    4.9
>>>>> > > > VecScale           1.070E-01    2.1        1.129E-01    2.2
>>>>>   1.301E-01    2.5        1.270E-01    2.4
>>>>> > > > VecCopy            1.123E-01    1.3        1.149E-01    1.3
>>>>>   1.301E-01    1.6        1.359E-01    1.6
>>>>> > > > VecSet             7.063E-01    1.7        6.968E-01    1.7
>>>>>   7.432E-01    1.8        7.425E-01    1.8
>>>>> > > > VecAXPY            1.166E+00    1.4        1.167E+00    1.4
>>>>>   1.221E+00    1.5        1.279E+00    1.6
>>>>> > > > VecAYPX            1.317E+00    1.6        1.290E+00    1.6
>>>>>   1.536E+00    1.9        1.499E+00    2.0
>>>>> > > > VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8
>>>>>   6.448E+00    3.0        6.472E+00    2.9
>>>>> > > > VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0
>>>>>   5.244E+01    2.7        4.995E+01    2.7
>>>>> > > > MatMult            3.561E+01    1.6        3.403E+01    1.5
>>>>>   3.435E+01    1.4        3.332E+01    1.4
>>>>> > > > MatMultAdd         1.124E+01    2.0        1.130E+01    2.1
>>>>>   2.093E+01    2.9        1.995E+01    2.7
>>>>> > > > MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6
>>>>>   1.477E+01    2.2        1.381E+01    2.1
>>>>> > > > MatSolve           1.949E-02    0.0        1.653E-02    0.0
>>>>>   4.789E-02    0.0        4.466E-02    0.0
>>>>> > > > MatSOR             6.610E+01    1.3        6.673E+01    1.3
>>>>>   7.111E+01    1.3        7.105E+01    1.3
>>>>> > > > MatResidual        2.647E+01    1.7        2.667E+01    1.7
>>>>>   2.446E+01    1.4        2.467E+01    1.5
>>>>> > > > PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4
>>>>>   5.427E-03    1.5        5.289E-03    1.4
>>>>> > > > PCApply            1.031E+02    1.0        1.035E+02    1.0
>>>>>   1.180E+02    1.0        1.164E+02    1.0
>>>>> > > >
>>>>> > > > I also slimmed down my code and basically wrote a simple weak
>>>>> scaling test (source files attached) so you can profile it yourself. I
>>>>> appreciate the offer Junchao, thank you.
>>>>> > > > You can adjust the system size per processor at runtime via
>>>>> "-nodes_per_proc 30" and the number of repeated calls to the function
>>>>> containing KSPsolve() via "-iterations 1000". The physical problem is
>>>>> simply calculating the electric potential from a homogeneous charge
>>>>> distribution, done multiple times to accumulate time in KSPsolve().
>>>>> > > > A job would be started using something like
>>>>> > > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size
>>>>> 1E-4 -iterations 1000 \\
>>>>> > > > -ksp_rtol 1E-6 \
>>>>> > > > -log_view -log_sync\
>>>>> > > > -pc_type gamg -pc_gamg_type classical\
>>>>> > > > -ksp_type cg \
>>>>> > > > -ksp_norm_type unpreconditioned \
>>>>> > > > -mg_levels_ksp_type richardson \
>>>>> > > > -mg_levels_ksp_norm_type none \
>>>>> > > > -mg_levels_pc_type sor \
>>>>> > > > -mg_levels_ksp_max_it 1 \
>>>>> > > > -mg_levels_pc_sor_its 1 \
>>>>> > > > -mg_levels_esteig_ksp_type cg \
>>>>> > > > -mg_levels_esteig_ksp_max_it 10 \
>>>>> > > > -gamg_est_ksp_type cg
>>>>> > > > , ideally started on a cube number of processes for a cubical
>>>>> process grid.
>>>>> > > > Using 125 processes and 10.000 iterations I get the output in
>>>>> "log_view_125_new.txt", which shows the same imbalance for me.
>>>>> > > > Michael
>>>>> > > >
>>>>> > > >
>>>>> > > > Am 02.06.2018 um 13:40 schrieb Mark Adams:
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <
>>>>> jczhang at mcs.anl.gov> wrote:
>>>>> > > >> Hi,Michael,
>>>>> > > >>  You can add -log_sync besides -log_view, which adds barriers
>>>>> to certain events but measures barrier time separately from the events. I
>>>>> find this option makes it easier to interpret log_view output.
>>>>> > > >>
>>>>> > > >> That is great (good to know).
>>>>> > > >>
>>>>> > > >> This should give us a better idea if your large VecScatter
>>>>> costs are from slow communication or if it catching some sort of load
>>>>> imbalance.
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> --Junchao Zhang
>>>>> > > >>
>>>>> > > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
>>>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>>> > > >> Barry: On its way. Could take a couple days again.
>>>>> > > >>
>>>>> > > >> Junchao: I unfortunately don't have access to a cluster with a
>>>>> faster network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking
>>>>> fat-tree network, which I realize causes parallel slowdown if the nodes are
>>>>> not connected to the same switch. Each node has 24 processors (2x12/socket)
>>>>> and four NUMA domains (two for each socket).
>>>>> > > >> The ranks are usually not distributed perfectly even, i.e. for
>>>>> 125 processes, of the six required nodes, five would use 21 cores and one
>>>>> 20.
>>>>> > > >> Would using another CPU type make a difference
>>>>> communication-wise? I could switch to faster ones (on the same network),
>>>>> but I always assumed this would only improve performance of the stuff that
>>>>> is unrelated to communication.
>>>>> > > >>
>>>>> > > >> Michael
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>> The log files have something like "Average time for zero size
>>>>> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
>>>>> network. A typical machine should give less than 1/10 of the latency you
>>>>> have. An easy way to try is just running the code on a machine with a
>>>>> faster network and see what happens.
>>>>> > > >>>
>>>>> > > >>> Also, how many cores & numa domains does a compute node have?
>>>>> I could not figure out how you distributed the 125 MPI ranks evenly.
>>>>> > > >>>
>>>>> > > >>> --Junchao Zhang
>>>>> > > >>>
>>>>> > > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
>>>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>>> > > >>> Hello again,
>>>>> > > >>>
>>>>> > > >>> here are the updated log_view files for 125 and 1000
>>>>> processors. I ran both problems twice, the first time with all processors
>>>>> per node allocated ("-1.txt"), the second with only half on twice the
>>>>> number of nodes ("-2.txt").
>>>>> > > >>>
>>>>> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
>>>>> Michael.Becker at physik.uni-giessen.de>
>>>>> > > >>>>> wrote:
>>>>> > > >>>>>
>>>>> > > >>>>> I noticed that for every individual KSP iteration, six
>>>>> vector objects are created and destroyed (with CG, more with e.g. GMRES).
>>>>> > > >>>>>
>>>>> > > >>>>   Hmm, it is certainly not intended at vectors be created and
>>>>> destroyed within each KSPSolve() could you please point us to the code that
>>>>> makes you think they are being created and destroyed?   We create all the
>>>>> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
>>>>> solve. Not that this would be a measurable distance.
>>>>> > > >>>>
>>>>> > > >>>
>>>>> > > >>> I mean this, right in the log_view output:
>>>>> > > >>>
>>>>> > > >>>> Memory usage is given in bytes:
>>>>> > > >>>>
>>>>> > > >>>> Object Type Creations Destructions Memory Descendants' Mem.
>>>>> > > >>>> Reports information only for process 0.
>>>>> > > >>>>
>>>>> > > >>>> --- Event Stage 0: Main Stage
>>>>> > > >>>>
>>>>> > > >>>> ...
>>>>> > > >>>>
>>>>> > > >>>> --- Event Stage 1: First Solve
>>>>> > > >>>>
>>>>> > > >>>> ...
>>>>> > > >>>>
>>>>> > > >>>> --- Event Stage 2: Remaining Solves
>>>>> > > >>>>
>>>>> > > >>>> Vector 23904 23904 1295501184 0.
>>>>> > > >>> I logged the exact number of KSP iterations over the 999
>>>>> timesteps and its exactly 23904/6 = 3984.
>>>>> > > >>> Michael
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
>>>>> > > >>>>
>>>>> > > >>>>  Please send the log file for 1000 with cg as the solver.
>>>>> > > >>>>
>>>>> > > >>>>   You should make a bar chart of each event for the two cases
>>>>> to see which ones are taking more time and which are taking less (we cannot
>>>>> tell with the two logs you sent us since they are for different solvers.)
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
>>>>> Michael.Becker at physik.uni-giessen.de>
>>>>> > > >>>>> wrote:
>>>>> > > >>>>>
>>>>> > > >>>>> I noticed that for every individual KSP iteration, six
>>>>> vector objects are created and destroyed (with CG, more with e.g. GMRES).
>>>>> > > >>>>>
>>>>> > > >>>>   Hmm, it is certainly not intended at vectors be created and
>>>>> destroyed within each KSPSolve() could you please point us to the code that
>>>>> makes you think they are being created and destroyed?   We create all the
>>>>> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
>>>>> solve. Not that this would be a measurable distance.
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>
>>>>> > > >>>>> This seems kind of wasteful, is this supposed to be like
>>>>> this? Is this even the reason for my problems? Apart from that, everything
>>>>> seems quite normal to me (but I'm not the expert here).
>>>>> > > >>>>>
>>>>> > > >>>>>
>>>>> > > >>>>> Thanks in advance.
>>>>> > > >>>>>
>>>>> > > >>>>> Michael
>>>>> > > >>>>>
>>>>> > > >>>>>
>>>>> > > >>>>>
>>>>> > > >>>>> <log_view_125procs.txt><log_vi
>>>>> > > >>>>> ew_1000procs.txt>
>>>>> > > >>>>>
>>>>> > > >>>
>>>>> > > >>>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >
>>>>> > > >
>>>>> > > > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><o-ws
>>>>> test-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_SeqAIJ.png
>>>>> ><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
>>>>> > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180613/f1c70936/attachment-0001.html>
-------------- next part --------------
using 216 of 216 processes
30^3 unknowns per processor
total system size: 180^3
mesh size: 0.0001

Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 189000  avg 189000  max 189000
Mat Object: 216 MPI processes
  type: mpiaij
  Mat Object: 216 MPI processes
    type: mpiaij
    Load Balance - Nonzeros: Min 189000  avg 189000  max 189000
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 189000  avg 189000  max 189000
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 94500  avg 96739  max 106784
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 94500  avg 95898  max 102716
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 248220  avg 259246  max 273660
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 246056  avg 259040  max 273525
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 111528  avg 117225  max 133209
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 58106  avg 66984  max 69685
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 89708  avg 116072  max 137752
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 87894  avg 115602  max 136661
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 20878  avg 27301  max 37274
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 5809  avg 7404  max 9417
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 3833  avg 5850  max 13508
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 3106  avg 5439  max 11996
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 1079  avg 1677  max 3439
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 232  avg 426  max 977
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 487  max 2786
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 487  max 15513
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 232  avg 426  max 977
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 391  max 12579
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 51  max 1471
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 19  max 574
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 1  max 171
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 1  max 361
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 19  max 574
initsolve: 2 iterations
solve 1: 2 iterations
solve 2: 2 iterations
solve 3: 2 iterations
solve 4: 2 iterations
solve 5: 2 iterations
solve 6: 2 iterations
solve 7: 2 iterations
solve 8: 2 iterations
solve 9: 2 iterations
solve 10: 2 iterations
solve 20: 2 iterations
solve 30: 2 iterations
solve 40: 2 iterations
solve 50: 2 iterations
solve 60: 2 iterations
solve 70: 2 iterations
solve 80: 2 iterations
solve 90: 2 iterations
solve 100: 2 iterations
solve 200: 2 iterations
solve 300: 2 iterations
solve 400: 2 iterations
solve 500: 2 iterations
solve 600: 2 iterations
solve 700: 2 iterations
solve 800: 2 iterations
solve 900: 2 iterations
solve 1000: 2 iterations

Time in solve():      35.8368 s
Time in KSPSolve():   35.5561 s (99.2166%)

Number of   KSP iterations (total): 2000
Number of solve iterations (total): 1000 (ratio: 2.00)

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./wstest on a intel-bdw-opt named bdw-0088 with 216 processors, by jczhang Tue Jun 12 23:17:27 2018
Using Petsc Development GIT revision: v3.9.2-570-g68f20b90  GIT Date: 2018-06-04 15:39:16 +0200

                         Max       Max/Min        Avg      Total 
Time (sec):           1.461e+02      1.00000   1.461e+02
Objects:              1.042e+04      1.00010   1.042e+04
Flop:                 1.076e+10      1.12766   1.020e+10  2.203e+12
Flop/sec:            7.364e+07      1.12766   6.981e+07  1.508e+10
MPI Messages:         8.832e+05      1.87220   5.444e+05  1.176e+08
MPI Message Lengths:  7.040e+08      1.10023   1.209e+03  1.422e+11
MPI Reductions:       1.055e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 9.1544e-02   0.1%  0.0000e+00   0.0%  2.592e+03   0.0%  1.802e+03        0.0%  1.700e+01   0.2% 
 1:     First Solve: 1.1015e+02  75.4%  4.6662e+09   0.2%  8.968e+05   0.8%  4.494e+03        2.8%  5.250e+02   5.0% 
 2: Remaining Solves: 3.5853e+01  24.5%  2.1983e+12  99.8%  1.167e+08  99.2%  1.184e+03       97.2%  1.000e+04  94.8% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecSet                 2 1.0 8.6784e-05 3.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: First Solve

BuildTwoSided         10 1.0 3.1507e-03 1.3 0.00e+00 0.0 2.4e+04 4.0e+00 0.0e+00  0  0  0  0  0   0  0  3  0  0     0
BuildTwoSidedF        27 1.0 6.3965e+00 2.6 0.00e+00 0.0 2.1e+04 1.1e+04 0.0e+00  3  0  0  0  0   4  0  2  6  0     0
KSPSetUp               8 1.0 3.2792e-03 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.6e+01  0  0  0  0  0   0  0  0  0  3     0
KSPSolve               1 1.0 1.1015e+02 1.0 2.33e+07 1.1 9.0e+05 4.5e+03 5.2e+02 75  0  1  3  5 100100100100100    42
VecTDot                3 1.0 9.1021e-0334.9 1.62e+05 1.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  1  0  0  1  3844
VecNorm                3 1.0 6.0630e-04 2.5 1.62e+05 1.0 0.0e+00 0.0e+00 3.0e+00  0  0  0  0  0   0  1  0  0  1 57714
VecScale              10 1.0 1.1373e-04 2.9 2.97e+04 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 52201
VecCopy                1 1.0 1.3185e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                64 1.0 7.6294e-04 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY                2 1.0 1.5998e-04 1.5 1.08e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 145819
VecAYPX               11 1.0 3.8528e-04 1.7 1.12e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  1  0  0  0 61860
VecAssemblyBegin       2 1.0 4.5776e-0548.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 5.2214e-0554.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin       44 1.0 2.4009e-03 1.6 0.00e+00 0.0 1.2e+05 1.2e+03 0.0e+00  0  0  0  0  0   0  0 13  3  0     0
VecScatterEnd         44 1.0 1.5850e-02 3.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult               12 1.0 9.0702e-03 1.6 3.07e+06 1.1 3.8e+04 1.8e+03 0.0e+00  0  0  0  0  0   0 13  4  2  0 69092
MatMultAdd            10 1.0 3.8619e-03 1.9 6.98e+05 1.0 2.2e+04 5.6e+02 0.0e+00  0  0  0  0  0   0  3  2  0  0 38197
MatMultTranspose      10 1.0 5.2512e-03 2.0 6.98e+05 1.0 2.2e+04 5.6e+02 0.0e+00  0  0  0  0  0   0  3  2  0  0 28092
MatSolve               2 0.0 1.0014e-05 0.0 1.41e+03 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
MatSOR                20 1.0 2.4410e-02 1.6 5.77e+06 1.2 3.5e+04 1.4e+03 4.0e+00  0  0  0  0  0   0 25  4  1  1 47866
MatLUFactorSym         1 1.0 1.7500e-0413.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 1.0 6.8903e-0518.1 4.40e+03 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    64
MatResidual           10 1.0 7.2005e-03 1.8 2.45e+06 1.1 3.5e+04 1.4e+03 0.0e+00  0  0  0  0  0   0 11  4  1  0 68474
MatAssemblyBegin      82 1.0 6.4096e+00 2.6 0.00e+00 0.0 2.1e+04 1.1e+04 0.0e+00  3  0  0  0  0   4  0  2  6  0     0
MatAssemblyEnd        82 1.0 1.1275e-01 1.2 0.00e+00 0.0 1.8e+05 4.3e+02 2.1e+02  0  0  0  0  2   0  0 20  2 40     0
MatGetRow        3346824 1.1 5.1589e+01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 34  0  0  0  0  46  0  0  0  0     0
MatGetRowIJ            1 0.0 8.1062e-06 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCreateSubMats       5 1.0 2.2660e-01 1.8 0.00e+00 0.0 1.6e+05 1.7e+04 1.0e+01  0  0  0  2  0   0  0 18 65  2     0
MatCreateSubMat        5 1.0 3.0296e-01 1.0 0.00e+00 0.0 3.7e+04 1.5e+04 8.4e+01  0  0  0  0  1   0  0  4 14 16     0
MatGetOrdering         1 0.0 5.6982e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatIncreaseOvrlp       5 1.0 3.2210e-02 1.1 0.00e+00 0.0 7.0e+04 9.6e+02 1.0e+01  0  0  0  0  0   0  0  8  2  2     0
MatCoarsen             5 1.0 2.0194e-02 1.1 0.00e+00 0.0 1.5e+05 5.5e+02 2.9e+01  0  0  0  0  0   0  0 16  2  6     0
MatZeroEntries         5 1.0 1.0461e-0210.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView               26 1.0 6.0523e-01 1.0 0.00e+00 0.0 3.3e+04 1.7e+04 5.1e+01  0  0  0  0  0   1  0  4 14 10     0
MatPtAP                5 1.0 2.0556e-01 1.0 1.25e+07 1.1 1.9e+05 2.5e+03 8.2e+01  0  0  0  0  1   0 53 21 12 16 11962
MatPtAPSymbolic        5 1.0 1.1618e-01 1.0 0.00e+00 0.0 9.4e+04 2.5e+03 3.5e+01  0  0  0  0  0   0  0 10  6  7     0
MatPtAPNumeric         5 1.0 8.9715e-02 1.0 1.25e+07 1.1 9.2e+04 2.5e+03 4.5e+01  0  0  0  0  0   0 53 10  6  9 27408
MatGetLocalMat         5 1.0 2.8648e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          5 1.0 1.3785e-02 3.5 0.00e+00 0.0 5.3e+04 3.3e+03 0.0e+00  0  0  0  0  0   0  0  6  4  0     0
SFSetGraph            10 1.0 1.0729e-04 4.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp               10 1.0 6.7554e-03 1.1 0.00e+00 0.0 7.3e+04 5.4e+02 0.0e+00  0  0  0  0  0   0  0  8  1  0     0
SFBcastBegin          39 1.0 1.5974e-03 1.3 0.00e+00 0.0 1.5e+05 6.4e+02 0.0e+00  0  0  0  0  0   0  0 17  2  0     0
SFBcastEnd            39 1.0 1.0992e-02 7.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
GAMG: createProl       5 1.0 1.0922e+02 1.0 0.00e+00 0.0 5.5e+05 5.1e+03 2.5e+02 75  0  0  2  2  99  0 62 71 48     0
GAMG: partLevel        5 1.0 2.1535e-01 1.0 1.25e+07 1.1 1.9e+05 2.5e+03 1.9e+02  0  0  0  0  2   0 53 21 12 36 11418
  repartition          2 1.0 8.9693e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  2     0
  Invert-Sort          2 1.0 9.0194e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  2     0
  Move A               2 1.0 4.0240e-03 1.1 0.00e+00 0.0 1.3e+03 9.9e+02 3.6e+01  0  0  0  0  0   0  0  0  0  7     0
  Move P               2 1.0 3.3162e-03 1.2 0.00e+00 0.0 2.3e+03 1.5e+01 3.6e+01  0  0  0  0  0   0  0  0  0  7     0
PCSetUp                2 1.0 1.0945e+02 1.0 1.25e+07 1.1 7.4e+05 4.5e+03 4.7e+02 75  0  1  2  4  99 53 83 82 90    22
PCSetUpOnBlocks        2 1.0 4.7374e-04 4.7 4.40e+03 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     9
PCApply                2 1.0 3.8466e-02 1.4 9.62e+06 1.1 1.1e+05 1.0e+03 4.0e+00  0  0  0  0  0   0 42 13  3  1 50864

--- Event Stage 2: Remaining Solves

KSPSolve            1000 1.0 3.5565e+01 1.0 1.07e+10 1.1 1.2e+08 1.2e+03 1.0e+04 24100 99 97 95  99100100100100 61811
VecTDot             3000 1.0 2.1074e+00 1.2 1.62e+08 1.0 0.0e+00 0.0e+00 3.0e+03  1  2  0  0 28   5  2  0  0 30 16604
VecNorm             3000 1.0 1.1686e+00 1.1 1.62e+08 1.0 0.0e+00 0.0e+00 3.0e+03  1  2  0  0 28   3  2  0  0 30 29945
VecScale           10000 1.0 6.4481e-02 1.6 2.97e+07 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 92068
VecCopy             1000 1.0 8.6886e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet             36000 1.0 4.7016e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
VecAXPY             2000 1.0 1.6832e-01 1.5 1.08e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0 138593
VecAYPX            11000 1.0 2.9297e-01 1.3 1.12e+08 1.1 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   1  1  0  0  0 81352
VecScatterBegin    42000 1.0 2.1308e+00 1.6 0.00e+00 0.0 1.2e+08 1.2e+03 0.0e+00  1  0 99 97  0   4  0100100  0     0
VecScatterEnd      42000 1.0 9.5539e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0  22  0  0  0  0     0
MatMult            12000 1.0 7.7250e+00 1.2 3.07e+09 1.1 3.8e+07 1.8e+03 0.0e+00  5 28 32 47  0  19 29 32 48  0 81124
MatMultAdd         10000 1.0 4.8386e+00 1.6 6.98e+08 1.0 2.2e+07 5.6e+02 0.0e+00  3  7 19  9  0  11  7 19  9  0 30487
MatMultTranspose   10000 1.0 4.4888e+00 1.5 6.98e+08 1.0 2.2e+07 5.6e+02 0.0e+00  3  7 19  9  0  10  7 19  9  0 32863
MatSolve            2000 0.0 1.1715e-02 0.0 1.41e+06 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   120
MatSOR             20000 1.0 1.7121e+01 1.1 5.72e+09 1.2 3.5e+07 1.4e+03 4.0e+03 11 53 30 33 38  45 53 30 34 40 67719
MatResidual        10000 1.0 6.4821e+00 1.2 2.45e+09 1.1 3.5e+07 1.4e+03 0.0e+00  4 22 30 33  0  16 22 30 34  0 76062
PCSetUpOnBlocks     2000 1.0 2.6277e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             2000 1.0 3.0727e+01 1.0 9.57e+09 1.1 1.1e+08 1.0e+03 4.0e+03 21 88 97 84 38  85 89 98 86 40 63382
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     1              8        10120     0.
     DMKSP interface     1              1          656     0.
              Vector     4             45      2445008     0.
              Matrix     0             59     15632384     0.
    Distributed Mesh     1              1         5248     0.
           Index Set     2             14       254340     0.
   IS L to G Mapping     1              1       131728     0.
   Star Forest Graph     2              2         1728     0.
     Discrete System     1              1          932     0.
         Vec Scatter     1             12       231168     0.
      Preconditioner     1              8         8692     0.
              Viewer     1              2         1680     0.
   Application Order     0              1     46656664     0.

--- Event Stage 1: First Solve

       Krylov Solver     7              0            0     0.
              Vector   112             71      2077080     0.
              Matrix   124             65     40701300     0.
      Matrix Coarsen     5              5         3180     0.
           Index Set   104             92     24404092     0.
   Star Forest Graph    10             10         8640     0.
         Vec Scatter    28             17        21488     0.
      Preconditioner     7              0            0     0.
              Viewer     2              0            0     0.
   Application Order     1              0            0     0.

--- Event Stage 2: Remaining Solves

              Vector 10000          10000    645984000     0.
========================================================================================================================
Average time to get PetscTime(): 6.19888e-07
Average time for MPI_Barrier(): 8.96454e-06
Average time for zero size MPI_Send(): 6.52781e-06
#PETSc Option Table entries:
-gamg_est_ksp_type cg
-iterations 1000
-ksp_norm_type unpreconditioned
-ksp_rtol 1E-6
-ksp_type cg
-log_view
-mat_view ::load_balance
-mesh_size 1E-4
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-nodes_per_proc 30
-pc_gamg_type classical
-pc_type gamg
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=no --COPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --CXXOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --FOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --with-openmp=1 --download-sowing --download-fblaslapack=1 --download-scalapack=1 --download-metis=1 --download-parmetis=1 --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --PETSC_ARCH=intel-bdw-opt --PETSC_DIR=/home/jczhang/petsc
-----------------------------------------
Libraries compiled on 2018-06-05 18:40:55 on beboplogin2 
Machine characteristics: Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Using PETSc directory: /home/jczhang/petsc
Using PETSc arch: intel-bdw-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -g -O3 -DPETSC_KERNEL_USE_UNROLL_4 -fopenmp  
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O3 -DPETSC_KERNEL_USE_UNROLL_4  -fopenmp   
-----------------------------------------

Using include paths: -I/home/jczhang/petsc/include -I/home/jczhang/petsc/intel-bdw-opt/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -lpetsc -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib/debug_mt -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib -lscalapack -lflapack -lfblas -lparmetis -lmetis -lm -lX11 -lstdc++ -ldl -lmpifort -lmpi -lmpigi -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------
-------------- next part --------------
using 216 of 216 processes
30^3 unknowns per processor
total system size: 180^3
mesh size: 0.0001

initsolve: 9 iterations
solve 1: 9 iterations
solve 2: 9 iterations
solve 3: 9 iterations
solve 4: 9 iterations
solve 5: 9 iterations
solve 6: 9 iterations
solve 7: 9 iterations
solve 8: 9 iterations
solve 9: 9 iterations
solve 10: 9 iterations
solve 20: 9 iterations
solve 30: 9 iterations
solve 40: 9 iterations
solve 50: 9 iterations
solve 60: 9 iterations
solve 70: 9 iterations
solve 80: 9 iterations
solve 90: 9 iterations
solve 100: 9 iterations
solve 200: 9 iterations
solve 300: 9 iterations
solve 400: 9 iterations
solve 500: 9 iterations
solve 600: 9 iterations
solve 700: 9 iterations
solve 800: 9 iterations
solve 900: 9 iterations
solve 1000: 9 iterations

Time in solve():      157.375 s
Time in KSPSolve():   157.136 s (99.8483%)

Number of   KSP iterations (total): 9000
Number of solve iterations (total): 1000 (ratio: 9.00)

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./wstest on a intel-bdw-opt named bdwd-0016 with 216 processors, by jczhang Tue Jun 12 15:33:46 2018
Using Petsc Development GIT revision: v3.9.2-570-g68f20b90  GIT Date: 2018-06-04 15:39:16 +0200

                         Max       Max/Min        Avg      Total 
Time (sec):           1.636e+02      1.00001   1.636e+02
Objects:              3.650e+04      1.00003   3.650e+04
Flop:                 5.496e+10      1.22075   5.177e+10  1.118e+13
Flop/sec:            3.359e+08      1.22075   3.164e+08  6.835e+10
MPI Messages:         2.993e+06      6.33817   1.352e+06  2.921e+08
MPI Message Lengths:  4.193e+09      2.87396   2.166e+03  6.326e+11
MPI Reductions:       4.771e+04      1.00006

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 1.3931e-01   0.1%  0.0000e+00   0.0%  2.160e+03   0.0%  1.802e+03        0.0%  1.700e+01   0.0% 
 1:     First Solve: 6.0653e+00   3.7%  2.1068e+10   0.2%  7.694e+05   0.3%  3.264e+03        0.4%  6.818e+02   1.4% 
 2: Remaining Solves: 1.5739e+02  96.2%  1.1160e+13  99.8%  2.913e+08  99.7%  2.163e+03       99.6%  4.700e+04  98.5% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecSet                 2 1.0 5.7220e-05 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: First Solve

BuildTwoSided          4 1.0 1.1373e-02 9.3 0.00e+00 0.0 4.6e+03 4.0e+00 0.0e+00  0  0  0  0  0   0  0  1  0  0     0
BuildTwoSidedF        38 1.0 1.9472e-01 3.9 0.00e+00 0.0 2.0e+04 2.3e+04 0.0e+00  0  0  0  0  0   2  0  3 18  0     0
KSPSetUp              11 1.0 1.0682e-02 7.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.4e+01  0  0  0  0  0   0  0  0  0  2     0
KSPSolve               1 1.0 6.0648e+00 1.0 1.06e+08 1.3 7.7e+05 3.3e+03 6.8e+02  4  0  0  0  1 100100100100100  3474
VecTDot              102 1.0 2.2696e-02 2.4 2.48e+06 1.0 0.0e+00 0.0e+00 1.0e+02  0  0  0  0  0   0  3  0  0 15 23428
VecNorm               11 1.0 3.3295e-03 1.4 5.94e+05 1.0 0.0e+00 0.0e+00 1.1e+01  0  0  0  0  0   0  1  0  0  2 38535
VecScale              36 1.0 3.5572e-04 3.8 1.36e+05 2.3 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 61994
VecCopy                9 1.0 3.7885e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               184 1.0 2.0907e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               98 1.0 3.5741e-03 1.2 2.41e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  2  0  0  0 144490
VecAYPX               81 1.0 3.3052e-03 1.6 1.43e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  1  0  0  0 92586
VecAssemblyBegin      12 1.0 1.3266e-03 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd        12 1.0 2.3293e-04 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecPointwiseMult      44 1.0 1.3185e-03 1.3 3.95e+05 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 63923
VecScatterBegin      210 1.0 1.0943e-02 2.5 0.00e+00 0.0 4.1e+05 2.2e+03 0.0e+00  0  0  0  0  0   0  0 54 35  0     0
VecScatterEnd        210 1.0 6.8774e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
VecSetRandom           4 1.0 1.3020e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult               86 1.0 6.3114e-02 1.4 2.88e+07 1.3 1.8e+05 2.4e+03 0.0e+00  0  0  0  0  0   1 27 24 18  0 91663
MatMultAdd            36 1.0 4.4785e-02 4.5 4.10e+06 1.3 5.8e+04 1.7e+03 0.0e+00  0  0  0  0  0   0  4  8  4  0 18261
MatMultTranspose      36 1.0 2.4255e-02 2.4 4.10e+06 1.3 5.8e+04 1.7e+03 0.0e+00  0  0  0  0  0   0  4  8  4  0 33718
MatSolve               9 0.0 1.1253e-04 0.0 8.89e+04 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   790
MatSOR                72 1.0 1.2917e-01 1.3 2.79e+07 1.2 8.2e+04 2.1e+03 1.8e+01  0  0  0  0  0   2 27 11  7  3 43962
MatLUFactorSym         1 1.0 1.4400e-0412.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 1.0 2.4796e-0461.2 2.28e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   919
MatConvert             8 1.0 5.9213e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  0  10  0  0  0  2     0
MatScale              12 1.0 5.9452e-03 1.6 1.82e+06 1.3 9.1e+03 2.1e+03 0.0e+00  0  0  0  0  0   0  2  1  1  0 61111
MatResidual           36 1.0 3.5977e-02 2.1 1.23e+07 1.3 8.2e+04 2.1e+03 0.0e+00  0  0  0  0  0   0 12 11  7  0 68156
MatAssemblyBegin      91 1.0 2.0844e-01 2.8 0.00e+00 0.0 2.0e+04 2.3e+04 0.0e+00  0  0  0  0  0   2  0  3 18  0     0
MatAssemblyEnd        91 1.0 2.1315e-01 2.0 0.00e+00 0.0 1.2e+05 3.0e+02 2.0e+02  0  0  0  0  0   2  0 15  1 29     0
MatGetRow         125604 1.1 2.0258e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  32  0  0  0  0     0
MatGetRowIJ            1 0.0 2.6941e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCreateSubMat        8 1.0 2.6344e-01 1.0 0.00e+00 0.0 4.8e+04 1.2e+04 1.3e+02  0  0  0  0  0   4  0  6 23 19     0
MatGetOrdering         1 0.0 7.4863e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatPartitioning        4 1.0 7.2333e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 8.8e+00  0  0  0  0  0  12  0  0  0  1     0
MatCoarsen             4 1.0 1.9829e-02 2.0 0.00e+00 0.0 9.8e+04 7.1e+02 2.7e+01  0  0  0  0  0   0  0 13  3  4     0
MatZeroEntries         4 1.0 2.1191e-03 3.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY                4 1.0 1.2023e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  20  0  0  0  0     0
MatMatMult             4 1.0 1.2128e-01 1.0 1.37e+06 1.3 5.7e+04 1.1e+03 5.0e+01  0  0  0  0  0   2  1  7  2  7  2206
MatMatMultSym          4 1.0 1.0835e-01 1.0 0.00e+00 0.0 4.8e+04 8.7e+02 4.8e+01  0  0  0  0  0   2  0  6  2  7     0
MatMatMultNum          4 1.0 9.3360e-03 1.0 1.37e+06 1.3 9.1e+03 2.1e+03 0.0e+00  0  0  0  0  0   0  1  1  1  0 28654
MatPtAP                4 1.0 3.1024e-01 1.0 3.11e+07 1.5 1.1e+05 7.7e+03 6.2e+01  0  0  0  0  0   5 27 15 35  9 18602
MatPtAPSymbolic        4 1.0 1.7289e-01 1.0 0.00e+00 0.0 5.8e+04 7.4e+03 2.8e+01  0  0  0  0  0   3  0  7 17  4     0
MatPtAPNumeric         4 1.0 1.3619e-01 1.0 3.11e+07 1.5 5.7e+04 8.1e+03 3.2e+01  0  0  0  0  0   2 27  7 18  5 42374
MatGetLocalMat        12 1.0 4.7843e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol         12 1.0 1.0638e-02 1.6 0.00e+00 0.0 6.4e+04 5.7e+03 0.0e+00  0  0  0  0  0   0  0  8 14  0     0
SFSetGraph             4 1.0 3.6716e-0538.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp                4 1.0 1.1905e-02 6.4 0.00e+00 0.0 1.4e+04 7.0e+02 0.0e+00  0  0  0  0  0   0  0  2  0  0     0
SFBcastBegin          35 1.0 1.3816e-03 2.8 0.00e+00 0.0 8.4e+04 7.1e+02 0.0e+00  0  0  0  0  0   0  0 11  2  0     0
SFBcastEnd            35 1.0 2.0089e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCGAMGGraph_AGG        4 1.0 2.4202e+00 1.0 1.37e+06 1.3 2.7e+04 1.1e+03 4.8e+01  1  0  0  0  0  40  1  4  1  7   113
PCGAMGCoarse_AGG       4 1.0 2.1739e-02 1.1 0.00e+00 0.0 9.8e+04 7.1e+02 2.7e+01  0  0  0  0  0   0  0 13  3  4     0
PCGAMGProl_AGG         4 1.0 3.5299e-02 1.0 0.00e+00 0.0 2.9e+04 1.5e+03 6.4e+01  0  0  0  0  0   1  0  4  2  9     0
PCGAMGPOpt_AGG         4 1.0 1.3753e+00 1.0 1.91e+07 1.2 1.5e+05 1.7e+03 1.7e+02  1  0  0  0  0  23 18 19 10 24  2804
GAMG: createProl       4 1.0 3.8539e+00 1.0 2.05e+07 1.2 3.0e+05 1.3e+03 3.0e+02  2  0  0  0  1  64 20 39 16 45  1071
  Graph                8 1.0 2.4189e+00 1.0 1.37e+06 1.3 2.7e+04 1.1e+03 4.8e+01  1  0  0  0  0  40  1  4  1  7   113
  MIS/Agg              4 1.0 1.9926e-02 2.0 0.00e+00 0.0 9.8e+04 7.1e+02 2.7e+01  0  0  0  0  0   0  0 13  3  4     0
  SA: col data         4 1.0 1.4641e-02 1.0 0.00e+00 0.0 1.8e+04 2.1e+03 1.6e+01  0  0  0  0  0   0  0  2  2  2     0
  SA: frmProl0         4 1.0 1.9329e-02 1.0 0.00e+00 0.0 1.1e+04 4.1e+02 3.2e+01  0  0  0  0  0   0  0  1  0  5     0
  SA: smooth           4 1.0 1.3270e+00 1.0 1.82e+06 1.3 5.7e+04 1.1e+03 5.8e+01  1  0  0  0  0  22  2  7  2  9   270
GAMG: partLevel        4 1.0 1.9283e+00 1.0 3.11e+07 1.5 1.7e+05 8.5e+03 2.9e+02  1  0  0  0  1  32 27 22 59 43  2993
  repartition          4 1.0 1.3092e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 4.9e+01  1  0  0  0  0  22  0  0  0  7     0
  Invert-Sort          4 1.0 3.9294e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.6e+01  0  0  0  0  0   1  0  0  0  2     0
  Move A               4 1.0 1.5547e-01 1.1 0.00e+00 0.0 3.5e+04 1.7e+04 6.8e+01  0  0  0  0  0   2  0  5 23 10     0
  Move P               4 1.0 1.1733e-01 1.0 0.00e+00 0.0 1.3e+04 4.3e+02 6.8e+01  0  0  0  0  0   2  0  2  0 10     0
PCSetUp                2 1.0 5.7918e+00 1.0 5.15e+07 1.4 4.8e+05 3.9e+03 6.2e+02  4  0  0  0  1  95 47 62 75 91  1709
PCSetUpOnBlocks        9 1.0 5.4765e-04 5.4 2.28e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   416
PCApply                9 1.0 1.9558e-01 1.1 4.84e+07 1.3 2.8e+05 2.0e+03 1.8e+01  0  0  0  0  0   3 46 36 22  3 49938

--- Event Stage 2: Remaining Solves

KSPSolve            1000 1.0 1.5715e+02 1.0 5.49e+10 1.2 2.9e+08 2.2e+03 4.7e+04 96100100100 99 100100100100100 71017
VecTDot            18000 1.0 9.7654e+00 1.4 9.72e+08 1.0 0.0e+00 0.0e+00 1.8e+04  5  2  0  0 38   5  2  0  0 38 21499
VecNorm            11000 1.0 3.2415e+00 1.1 5.94e+08 1.0 0.0e+00 0.0e+00 1.1e+04  2  1  0  0 23   2  1  0  0 23 39582
VecScale           36000 1.0 1.9780e-01 1.9 1.36e+08 2.3 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 111486
VecCopy             1000 1.0 3.3195e-01 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet            135000 1.0 1.5111e+00 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY            18000 1.0 1.2813e+00 1.3 9.72e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0 163865
VecAYPX            45000 1.0 1.5992e+00 1.3 7.82e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0 105117
VecScatterBegin   154000 1.0 7.6855e+00 2.7 0.00e+00 0.0 2.9e+08 2.2e+03 0.0e+00  3  0100100  0   3  0100100  0     0
VecScatterEnd     154000 1.0 4.1504e+01 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 19  0  0  0  0  20  0  0  0  0     0
MatMult            46000 1.0 3.4541e+01 1.4 1.55e+10 1.2 9.3e+07 2.7e+03 0.0e+00 17 28 32 40  0  18 28 32 40  0 90831
MatMultAdd         36000 1.0 2.6609e+01 2.3 4.10e+09 1.3 5.8e+07 1.7e+03 0.0e+00 12  7 20 16  0  12  7 20 16  0 30735
MatMultTranspose   36000 1.0 2.0506e+01 1.8 4.10e+09 1.3 5.8e+07 1.7e+03 0.0e+00 10  7 20 16  0  10  7 20 16  0 39881
MatSolve            9000 0.0 1.1517e-01 0.0 8.89e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   772
MatSOR             72000 1.0 7.8436e+01 1.2 2.79e+10 1.2 8.2e+07 2.1e+03 1.8e+04 45 51 28 28 38  47 51 28 28 38 72300
MatResidual        36000 1.0 2.9426e+01 1.5 1.23e+10 1.3 8.2e+07 2.1e+03 0.0e+00 14 22 28 28  0  15 22 28 28  0 83328
PCSetUpOnBlocks     9000 1.0 1.2507e-0111.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             9000 1.0 1.3756e+02 1.0 4.83e+10 1.3 2.8e+08 2.0e+03 1.8e+04 84 87 96 87 38  87 87 96 88 38 70939
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     1              7         8816     0.
     DMKSP interface     1              1          656     0.
              Vector     4             38      2203656     0.
              Matrix     0             36     21544876     0.
    Distributed Mesh     1              1         5248     0.
           Index Set     2             14     14769980     0.
   IS L to G Mapping     1              1       131728     0.
   Star Forest Graph     2              2         1728     0.
     Discrete System     1              1          932     0.
         Vec Scatter     1             10       228640     0.
      Preconditioner     1              7         7448     0.
              Viewer     1              0            0     0.

--- Event Stage 1: First Solve

       Krylov Solver    10              4         6400     0.
              Vector   172            138      6448200     0.
              Matrix   141            105     37914116     0.
 Matrix Partitioning     4              4         2624     0.
      Matrix Coarsen     4              4         2544     0.
           Index Set   102             90     14955164     0.
   Star Forest Graph     4              4         3456     0.
         Vec Scatter    33             24        77312     0.
      Preconditioner    10              4         3424     0.
         PetscRandom     8              8         5168     0.

--- Event Stage 2: Remaining Solves

              Vector 36000          36000   2602584000     0.
========================================================================================================================
Average time to get PetscTime(): 6.19888e-07
Average time for MPI_Barrier(): 9.20296e-06
Average time for zero size MPI_Send(): 2.39677e-05
#PETSc Option Table entries:
-gamg_est_ksp_type cg
-iterations 1000
-ksp_norm_type unpreconditioned
-ksp_rtol 1E-6
-ksp_type cg
-log_view
-mesh_size 1E-4
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-nodes_per_proc 30
-pc_gamg_agg_nsmooths 1
-pc_gamg_coarse_eq_limit 1000
-pc_gamg_repartition
-pc_gamg_reuse_interpolation true
-pc_gamg_square_graph 0
-pc_gamg_threshold 0.0
-pc_gamg_threshold_scale .0
-pc_gamg_type agg
-pc_type gamg
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=no --COPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --CXXOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --FOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --with-openmp=1 --download-sowing --download-fblaslapack=1 --download-scalapack=1 --download-metis=1 --download-parmetis=1 --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --PETSC_ARCH=intel-bdw-opt --PETSC_DIR=/home/jczhang/petsc
-----------------------------------------
Libraries compiled on 2018-06-05 18:40:55 on beboplogin2 
Machine characteristics: Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Using PETSc directory: /home/jczhang/petsc
Using PETSc arch: intel-bdw-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -g -O3 -DPETSC_KERNEL_USE_UNROLL_4 -fopenmp  
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O3 -DPETSC_KERNEL_USE_UNROLL_4  -fopenmp   
-----------------------------------------

Using include paths: -I/home/jczhang/petsc/include -I/home/jczhang/petsc/intel-bdw-opt/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -lpetsc -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib/debug_mt -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib -lscalapack -lflapack -lfblas -lparmetis -lmetis -lm -lX11 -lstdc++ -ldl -lmpifort -lmpi -lmpigi -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------