[petsc-dev] [petsc-users] Poor weak scaling when solving successive linearsystems

Thu Jun 7 17:52:57 CDT 2018

OK, I have thought that space was a typo. btw, this option does not show up
in -h.
I changed number of ranks to use all cores on each node to avoid misleading
ratio in -log_view. Since one node has 36 cores, I ran with 6^3=216 ranks,
and 12^3=1728 ranks. I also found call counts of MatSOR etc in the two
tests were different. So they are not strict weak scaling tests. I tried to
add -ksp_max_it 6 -pc_mg_levels 6, but still could not make the two have
the same MatSOR count. Anyway, I attached the load balance output.

I find PCApply_MG calls PCMGMCycle_Private, which is recursive and
indirectly calls MatSOR_MPIAIJ. I believe the following code in
MatSOR_MPIAIJ practically syncs {MatSOR, MatMultAdd}_SeqAIJ  between
processors through VecScatter at each MG level. If SOR and MatMultAdd are
imbalanced, the cost is accumulated along MG levels and shows up as large
VecScatter cost.

1460:     while (its--) {1461:       VecScatterBegin
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterBegin.html#VecScatterBegin>(mat->Mvctx,xx,mat->lvec,INSERT_VALUES
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/INSERT_VALUES.html#INSERT_VALUES>,SCATTER_FORWARD
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/SCATTER_FORWARD.html#SCATTER_FORWARD>);1462:
      VecScatterEnd
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterEnd.html#VecScatterEnd>(mat->Mvctx,xx,mat->lvec,INSERT_VALUES
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/INSERT_VALUES.html#INSERT_VALUES>,SCATTER_FORWARD
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/SCATTER_FORWARD.html#SCATTER_FORWARD>);
1464:       /* update rhs: bb1 = bb - B*x */1465:       VecScale
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScale.html#VecScale>(mat->lvec,-1.0);1466:
      (*mat->B->ops->multadd)(mat->B,mat->lvec,bb,bb1);
1468:       /* local sweep */1469:
(*mat->A->ops->sor)(mat->A,bb1,omega,SOR_SYMMETRIC_SWEEP
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSORType.html#MatSORType>,fshift,lits,1,xx);1470:
    }

--Junchao Zhang

On Thu, Jun 7, 2018 at 3:11 PM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>
> > On Jun 7, 2018, at 12:27 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
> >
> > Searched but could not find this option, -mat_view::load_balance
>
>    There is a space between the view and the :   load_balance is a
> particular viewer format that causes the printing of load balance
> information about number of nonzeros in the matrix.
>
>    Barry
>
> >
> > --Junchao Zhang
> >
> > On Thu, Jun 7, 2018 at 10:46 AM, Smith, Barry F. <bsmith at mcs.anl.gov>
> wrote:
> >  So the only surprise in the results is the SOR. It is embarrassingly
> parallel and normally one would not see a jump.
> >
> >  The load balance for SOR time 1.5 is better at 1000 processes than for
> 125 processes of 2.1  not worse so this number doesn't easily explain it.
> >
> >  Could you run the 125 and 1000 with -mat_view ::load_balance and see
> what you get out?
> >
> >    Thanks
> >
> >      Barry
> >
> >  Notice that the MatSOR time jumps a lot about 5 secs when the -log_sync
> is on. My only guess is that the MatSOR is sharing memory bandwidth (or
> some other resource? cores?) with the VecScatter and for some reason this
> is worse for 1000 cores but I don't know why.
> >
> > > On Jun 6, 2018, at 9:13 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> > >
> > > Hi, PETSc developers,
> > >  I tested Michael Becker's code. The code calls the same KSPSolve 1000
> times in the second stage and needs cubic number of processors to run. I
> ran with 125 ranks and 1000 ranks, with or without -log_sync option. I
> attach the log view output files and a scaling loss excel file.
> > >  I profiled the code with 125 processors. It looks {MatSOR, MatMult,
> MatMultAdd, MatMultTranspose, MatMultTransposeAdd}_SeqAIJ in aij.c took
> ~50% of the time,  The other half time was spent on waiting in MPI.
> MatSOR_SeqAIJ took 30%, mostly in PetscSparseDenseMinusDot().
> > >  I tested it on a 36 cores/node machine. I found 32 ranks/node gave
> better performance (about 10%) than 36 ranks/node in the 125 ranks
> testing.  I guess it is because processors in the former had more balanced
> memory bandwidth. I collected PAPI_DP_OPS (double precision operations) and
> PAPI_TOT_CYC (total cycles) of the 125 ranks case (see the attached files).
> It looks ranks at the two ends have less DP_OPS and TOT_CYC.
> > >  Does anyone familiar with the algorithm have quick explanations?
> > >
> > > --Junchao Zhang
> > >
> > > On Mon, Jun 4, 2018 at 11:59 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
> > > Hello again,
> > >
> > > this took me longer than I anticipated, but here we go.
> > > I did reruns of the cases where only half the processes per node were
> used (without -log_sync):
> > >
> > >                     125 procs,1st           125 procs,2nd
> 1000 procs,1st          1000 procs,2nd
> > >                   Max        Ratio        Max        Ratio        Max
>       Ratio        Max        Ratio
> > > KSPSolve           1.203E+02    1.0        1.210E+02    1.0
> 1.399E+02    1.1        1.365E+02    1.0
> > > VecTDot            6.376E+00    3.7        6.551E+00    4.0
> 7.885E+00    2.9        7.175E+00    3.4
> > > VecNorm            4.579E+00    7.1        5.803E+00   10.2
> 8.534E+00    6.9        6.026E+00    4.9
> > > VecScale           1.070E-01    2.1        1.129E-01    2.2
> 1.301E-01    2.5        1.270E-01    2.4
> > > VecCopy            1.123E-01    1.3        1.149E-01    1.3
> 1.301E-01    1.6        1.359E-01    1.6
> > > VecSet             7.063E-01    1.7        6.968E-01    1.7
> 7.432E-01    1.8        7.425E-01    1.8
> > > VecAXPY            1.166E+00    1.4        1.167E+00    1.4
> 1.221E+00    1.5        1.279E+00    1.6
> > > VecAYPX            1.317E+00    1.6        1.290E+00    1.6
> 1.536E+00    1.9        1.499E+00    2.0
> > > VecScatterBegin    6.142E+00    3.2        5.974E+00    2.8
> 6.448E+00    3.0        6.472E+00    2.9
> > > VecScatterEnd      3.606E+01    4.2        3.551E+01    4.0
> 5.244E+01    2.7        4.995E+01    2.7
> > > MatMult            3.561E+01    1.6        3.403E+01    1.5
> 3.435E+01    1.4        3.332E+01    1.4
> > > MatMultAdd         1.124E+01    2.0        1.130E+01    2.1
> 2.093E+01    2.9        1.995E+01    2.7
> > > MatMultTranspose   1.372E+01    2.5        1.388E+01    2.6
> 1.477E+01    2.2        1.381E+01    2.1
> > > MatSolve           1.949E-02    0.0        1.653E-02    0.0
> 4.789E-02    0.0        4.466E-02    0.0
> > > MatSOR             6.610E+01    1.3        6.673E+01    1.3
> 7.111E+01    1.3        7.105E+01    1.3
> > > MatResidual        2.647E+01    1.7        2.667E+01    1.7
> 2.446E+01    1.4        2.467E+01    1.5
> > > PCSetUpOnBlocks    5.266E-03    1.4        5.295E-03    1.4
> 5.427E-03    1.5        5.289E-03    1.4
> > > PCApply            1.031E+02    1.0        1.035E+02    1.0
> 1.180E+02    1.0        1.164E+02    1.0
> > >
> > > I also slimmed down my code and basically wrote a simple weak scaling
> test (source files attached) so you can profile it yourself. I appreciate
> the offer Junchao, thank you.
> > > You can adjust the system size per processor at runtime via
> "-nodes_per_proc 30" and the number of repeated calls to the function
> containing KSPsolve() via "-iterations 1000". The physical problem is
> simply calculating the electric potential from a homogeneous charge
> distribution, done multiple times to accumulate time in KSPsolve().
> > > A job would be started using something like
> > > mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30 -mesh_size 1E-4
> -iterations 1000 \\
> > > -ksp_rtol 1E-6 \
> > > -log_view -log_sync\
> > > -pc_type gamg -pc_gamg_type classical\
> > > -ksp_type cg \
> > > -ksp_norm_type unpreconditioned \
> > > -mg_levels_ksp_type richardson \
> > > -mg_levels_ksp_norm_type none \
> > > -mg_levels_pc_type sor \
> > > -mg_levels_ksp_max_it 1 \
> > > -mg_levels_pc_sor_its 1 \
> > > -mg_levels_esteig_ksp_type cg \
> > > -mg_levels_esteig_ksp_max_it 10 \
> > > -gamg_est_ksp_type cg
> > > , ideally started on a cube number of processes for a cubical process
> grid.
> > > Using 125 processes and 10.000 iterations I get the output in
> "log_view_125_new.txt", which shows the same imbalance for me.
> > > Michael
> > >
> > >
> > > Am 02.06.2018 um 13:40 schrieb Mark Adams:
> > >>
> > >>
> > >> On Fri, Jun 1, 2018 at 11:20 PM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
> > >> Hi,Michael,
> > >>  You can add -log_sync besides -log_view, which adds barriers to
> certain events but measures barrier time separately from the events. I find
> this option makes it easier to interpret log_view output.
> > >>
> > >> That is great (good to know).
> > >>
> > >> This should give us a better idea if your large VecScatter costs are
> from slow communication or if it catching some sort of load imbalance.
> > >>
> > >>
> > >> --Junchao Zhang
> > >>
> > >> On Wed, May 30, 2018 at 3:27 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
> > >> Barry: On its way. Could take a couple days again.
> > >>
> > >> Junchao: I unfortunately don't have access to a cluster with a faster
> network. This one has a mixed 4X QDR-FDR InfiniBand 2:1 blocking fat-tree
> network, which I realize causes parallel slowdown if the nodes are not
> connected to the same switch. Each node has 24 processors (2x12/socket) and
> four NUMA domains (two for each socket).
> > >> The ranks are usually not distributed perfectly even, i.e. for 125
> processes, of the six required nodes, five would use 21 cores and one 20.
> > >> Would using another CPU type make a difference communication-wise? I
> could switch to faster ones (on the same network), but I always assumed
> this would only improve performance of the stuff that is unrelated to
> communication.
> > >>
> > >> Michael
> > >>
> > >>
> > >>
> > >>> The log files have something like "Average time for zero size
> MPI_Send(): 1.84231e-05". It looks you ran on a cluster with a very slow
> network. A typical machine should give less than 1/10 of the latency you
> have. An easy way to try is just running the code on a machine with a
> faster network and see what happens.
> > >>>
> > >>> Also, how many cores & numa domains does a compute node have? I
> could not figure out how you distributed the 125 MPI ranks evenly.
> > >>>
> > >>> --Junchao Zhang
> > >>>
> > >>> On Tue, May 29, 2018 at 6:18 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
> > >>> Hello again,
> > >>>
> > >>> here are the updated log_view files for 125 and 1000 processors. I
> ran both problems twice, the first time with all processors per node
> allocated ("-1.txt"), the second with only half on twice the number of
> nodes ("-2.txt").
> > >>>
> > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de>
> > >>>>> wrote:
> > >>>>>
> > >>>>> I noticed that for every individual KSP iteration, six vector
> objects are created and destroyed (with CG, more with e.g. GMRES).
> > >>>>>
> > >>>>   Hmm, it is certainly not intended at vectors be created and
> destroyed within each KSPSolve() could you please point us to the code that
> makes you think they are being created and destroyed?   We create all the
> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
> solve. Not that this would be a measurable distance.
> > >>>>
> > >>>
> > >>> I mean this, right in the log_view output:
> > >>>
> > >>>> Memory usage is given in bytes:
> > >>>>
> > >>>> Object Type Creations Destructions Memory Descendants' Mem.
> > >>>> Reports information only for process 0.
> > >>>>
> > >>>> --- Event Stage 0: Main Stage
> > >>>>
> > >>>> ...
> > >>>>
> > >>>> --- Event Stage 1: First Solve
> > >>>>
> > >>>> ...
> > >>>>
> > >>>> --- Event Stage 2: Remaining Solves
> > >>>>
> > >>>> Vector 23904 23904 1295501184 0.
> > >>> I logged the exact number of KSP iterations over the 999 timesteps
> and its exactly 23904/6 = 3984.
> > >>> Michael
> > >>>
> > >>>
> > >>> Am 24.05.2018 um 19:50 schrieb Smith, Barry F.:
> > >>>>
> > >>>>  Please send the log file for 1000 with cg as the solver.
> > >>>>
> > >>>>   You should make a bar chart of each event for the two cases to
> see which ones are taking more time and which are taking less (we cannot
> tell with the two logs you sent us since they are for different solvers.)
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>> On May 24, 2018, at 12:24 AM, Michael Becker <
> Michael.Becker at physik.uni-giessen.de>
> > >>>>> wrote:
> > >>>>>
> > >>>>> I noticed that for every individual KSP iteration, six vector
> objects are created and destroyed (with CG, more with e.g. GMRES).
> > >>>>>
> > >>>>   Hmm, it is certainly not intended at vectors be created and
> destroyed within each KSPSolve() could you please point us to the code that
> makes you think they are being created and destroyed?   We create all the
> work vectors at KSPSetUp() and destroy them in KSPReset() not during the
> solve. Not that this would be a measurable distance.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>> This seems kind of wasteful, is this supposed to be like this? Is
> this even the reason for my problems? Apart from that, everything seems
> quite normal to me (but I'm not the expert here).
> > >>>>>
> > >>>>>
> > >>>>> Thanks in advance.
> > >>>>>
> > >>>>> Michael
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> <log_view_125procs.txt><log_vi
> > >>>>> ew_1000procs.txt>
> > >>>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > >
> > > <o-wstest-125.txt><Scaling-loss.png><o-wstest-1000.txt><
> o-wstest-sync-125.txt><o-wstest-sync-1000.txt><MatSOR_
> SeqAIJ.png><PAPI_TOT_CYC.png><PAPI_DP_OPS.png>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180607/ec8c0048/attachment-0001.html>
-------------- next part --------------
using 216 of 216 processes
30^3 unknowns per processor
total system size: 180^3
mesh size: 0.0001

Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 186300  avg 188100  max 189000
Mat Object: 216 MPI processes
  type: mpiaij
  Mat Object: 216 MPI processes
    type: mpiaij
    Load Balance - Nonzeros: Min 161520  avg 188100  max 188520
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 156360  avg 177577  max 189000
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 75656  avg 87908  max 94500
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 75656  avg 87908  max 94500
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 201530  avg 237200  max 256500
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 201530  avg 237200  max 256500
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 85956  avg 102829  max 111569
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 54571  avg 64151  max 69123
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 84688  avg 107835  max 117713
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 83920  avg 107459  max 117667
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 20241  avg 25363  max 27748
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 6042  avg 7152  max 7637
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 3423  avg 5291  max 5994
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 3047  avg 4938  max 5691
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 1105  avg 1767  max 2171
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 284  avg 475  max 584
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 137  avg 484  max 972
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 484  max 7633
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 284  avg 475  max 584
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 413  max 6197
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 139  max 2244
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 34  max 614
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 24  max 752
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 24  max 5282
Mat Object: 216 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 34  max 614
initsolve: 7 iterations
solve 1: 6 iterations
solve 2: 6 iterations
solve 3: 6 iterations
solve 4: 6 iterations
solve 5: 6 iterations
solve 6: 6 iterations
solve 7: 6 iterations
solve 8: 6 iterations
solve 9: 6 iterations
solve 10: 6 iterations
solve 20: 6 iterations
solve 30: 6 iterations
solve 40: 6 iterations
solve 50: 6 iterations
solve 60: 6 iterations
solve 70: 6 iterations
solve 80: 6 iterations
solve 90: 6 iterations
solve 100: 6 iterations
solve 200: 6 iterations
solve 300: 6 iterations
solve 400: 6 iterations
solve 500: 6 iterations
solve 600: 6 iterations
solve 700: 6 iterations
solve 800: 6 iterations
solve 900: 6 iterations
solve 1000: 6 iterations

Time in solve():      89.4284 s
Time in KSPSolve():   89.1823 s (99.7248%)

Number of   KSP iterations (total): 6000
Number of solve iterations (total): 1000 (ratio: 6.00)

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./wstest on a intel-bdw-opt named bdw-0140 with 216 processors, by jczhang Thu Jun  7 17:04:25 2018
Using Petsc Development GIT revision: v3.9.2-570-g68f20b90  GIT Date: 2018-06-04 15:39:16 +0200

                         Max       Max/Min        Avg      Total 
Time (sec):           1.916e+02      1.00001   1.916e+02
Objects:              3.044e+04      1.00003   3.044e+04
Flop:                 3.177e+10      1.15810   3.035e+10  6.557e+12
Flop/sec:            1.658e+08      1.15810   1.584e+08  3.422e+10
MPI Messages:         1.594e+06      3.50605   1.083e+06  2.339e+08
MPI Message Lengths:  1.961e+09      2.19940   1.466e+03  3.428e+11
MPI Reductions:       3.258e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 1.0241e-01   0.1%  0.0000e+00   0.0%  2.160e+03   0.0%  1.802e+03        0.0%  1.700e+01   0.1% 
 1:     First Solve: 1.0204e+02  53.3%  9.8679e+09   0.2%  7.808e+05   0.3%  4.093e+03        0.9%  5.530e+02   1.7% 
 2: Remaining Solves: 8.9446e+01  46.7%  6.5467e+12  99.8%  2.331e+08  99.7%  1.457e+03       99.1%  3.200e+04  98.2% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecSet                 2 1.0 6.4135e-05 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: First Solve

BuildTwoSided         10 1.0 3.3987e-03 1.7 0.00e+00 0.0 1.6e+04 4.0e+00 0.0e+00  0  0  0  0  0   0  0  2  0  0     0
BuildTwoSidedF        27 1.0 7.8870e+00 3.1 0.00e+00 0.0 1.2e+04 1.1e+04 0.0e+00  2  0  0  0  0   4  0  2  4  0     0
KSPSetUp               8 1.0 2.9860e-03 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 1.6e+01  0  0  0  0  0   0  0  0  0  3     0
KSPSolve               1 1.0 1.0204e+02 1.0 4.82e+07 1.2 7.8e+05 4.1e+03 5.5e+02 53  0  0  1  2 100100100100100    97
VecTDot               14 1.0 2.9919e-03 2.2 7.56e+05 1.0 0.0e+00 0.0e+00 1.4e+01  0  0  0  0  0   0  2  0  0  3 54578
VecNorm                9 1.0 1.2019e-03 1.8 4.86e+05 1.0 0.0e+00 0.0e+00 9.0e+00  0  0  0  0  0   0  1  0  0  2 87344
VecScale              35 1.0 3.3951e-04 2.7 9.47e+04 2.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 48655
VecCopy                1 1.0 1.0705e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               154 1.0 1.9858e-03 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               14 1.0 9.7609e-04 1.2 7.56e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  2  0  0  0 167297
VecAYPX               42 1.0 1.5566e-03 1.5 6.46e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  1  0  0  0 88739
VecAssemblyBegin       2 1.0 4.7922e-0522.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 2.9087e-0530.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin      150 1.0 5.4379e-03 2.0 0.00e+00 0.0 2.7e+05 1.5e+03 0.0e+00  0  0  0  0  0   0  0 35 12  0     0
VecScatterEnd        150 1.0 1.9689e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult               43 1.0 2.1787e-02 1.2 1.05e+07 1.1 9.2e+04 2.1e+03 0.0e+00  0  0  0  0  0   0 22 12  6  0 99634
MatMultAdd            35 1.0 9.9871e-03 1.5 2.40e+06 1.3 4.8e+04 7.1e+02 0.0e+00  0  0  0  0  0   0  5  6  1  0 48362
MatMultTranspose      35 1.0 1.1008e-02 1.4 2.40e+06 1.3 4.8e+04 7.1e+02 0.0e+00  0  0  0  0  0   0  5  6  1  0 43876
MatSolve               7 0.0 2.2888e-04 0.0 8.72e+04 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   381
MatSOR                70 1.0 5.0331e-02 1.1 1.90e+07 1.2 8.3e+04 1.6e+03 1.4e+01  0  0  0  0  0   0 40 11  4  3 77978
MatLUFactorSym         1 1.0 3.8791e-0428.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 1.0 3.1900e-0478.7 3.10e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   973
MatResidual           35 1.0 1.7441e-02 1.3 7.97e+06 1.2 8.3e+04 1.6e+03 0.0e+00  0  0  0  0  0   0 17 11  4  0 93440
MatAssemblyBegin      82 1.0 7.8904e+00 3.1 0.00e+00 0.0 1.2e+04 1.1e+04 0.0e+00  2  0  0  0  0   4  0  2  4  0     0
MatAssemblyEnd        82 1.0 7.4100e-02 1.0 0.00e+00 0.0 1.1e+05 6.2e+02 2.1e+02  0  0  0  0  1   0  0 15  2 38     0
MatGetRow        3100265 1.2 4.7804e+01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 24  0  0  0  0  45  0  0  0  0     0
MatGetRowIJ            1 0.0 3.3140e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCreateSubMats       5 1.0 1.8501e-01 2.3 0.00e+00 0.0 1.0e+05 1.8e+04 1.0e+01  0  0  0  1  0   0  0 13 55  2     0
MatCreateSubMat        5 1.0 2.7853e-01 1.0 0.00e+00 0.0 3.6e+04 1.6e+04 8.4e+01  0  0  0  0  0   0  0  5 18 15     0
MatGetOrdering         1 0.0 1.4496e-04 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatIncreaseOvrlp       5 1.0 3.0473e-02 1.2 0.00e+00 0.0 4.8e+04 1.0e+03 1.0e+01  0  0  0  0  0   0  0  6  2  2     0
MatCoarsen             5 1.0 9.5112e-03 1.1 0.00e+00 0.0 9.2e+04 6.3e+02 3.0e+01  0  0  0  0  0   0  0 12  2  5     0
MatZeroEntries         5 1.0 1.7691e-03 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView               26 1.0 5.6732e-01 1.0 0.00e+00 0.0 3.3e+04 1.7e+04 5.1e+01  0  0  0  0  0   1  0  4 18  9     0
MatPtAP                5 1.0 1.3221e-01 1.0 1.13e+07 1.3 1.2e+05 2.7e+03 8.2e+01  0  0  0  0  0   0 23 15 10 15 16915
MatPtAPSymbolic        5 1.0 8.2783e-02 1.0 0.00e+00 0.0 6.1e+04 2.8e+03 3.5e+01  0  0  0  0  0   0  0  8  5  6     0
MatPtAPNumeric         5 1.0 4.9810e-02 1.0 1.13e+07 1.3 5.5e+04 2.6e+03 4.5e+01  0  0  0  0  0   0 23  7  4  8 44898
MatGetLocalMat         5 1.0 2.6979e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          5 1.0 4.0371e-03 1.5 0.00e+00 0.0 3.6e+04 3.7e+03 0.0e+00  0  0  0  0  0   0  0  5  4  0     0
SFSetGraph            10 1.0 9.2030e-05 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp               10 1.0 5.9166e-03 1.1 0.00e+00 0.0 4.8e+04 6.4e+02 0.0e+00  0  0  0  0  0   0  0  6  1  0     0
SFBcastBegin          40 1.0 1.4107e-03 1.8 0.00e+00 0.0 9.4e+04 7.4e+02 0.0e+00  0  0  0  0  0   0  0 12  2  0     0
SFBcastEnd            40 1.0 2.5785e-03 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
GAMG: createProl       5 1.0 1.0119e+02 1.0 0.00e+00 0.0 3.6e+05 5.4e+03 2.6e+02 53  0  0  1  1  99  0 45 60 46     0
GAMG: partLevel        5 1.0 1.4521e-01 1.0 1.13e+07 1.3 1.2e+05 2.6e+03 1.9e+02  0  0  0  0  1   0 23 15 10 34 15401
  repartition          2 1.0 9.1791e-04 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  2     0
  Invert-Sort          2 1.0 6.6185e-04 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
  Move A               2 1.0 3.4759e-03 1.1 0.00e+00 0.0 1.5e+03 9.0e+02 3.6e+01  0  0  0  0  0   0  0  0  0  7     0
  Move P               2 1.0 7.2892e-03 1.0 0.00e+00 0.0 1.7e+03 1.7e+01 3.6e+01  0  0  0  0  0   0  0  0  0  7     0
PCSetUp                2 1.0 1.0135e+02 1.0 1.13e+07 1.3 4.7e+05 4.7e+03 4.7e+02 53  0  0  1  1  99 23 61 70 85    22
PCSetUpOnBlocks        7 1.0 1.0257e-03 5.0 3.10e+05 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   302
PCApply                7 1.0 8.5200e-02 1.0 3.18e+07 1.2 2.6e+05 1.3e+03 1.4e+01  0  0  0  0  0   0 66 34 10  3 76535

--- Event Stage 2: Remaining Solves

KSPSolve            1000 1.0 8.9193e+01 1.0 3.17e+10 1.2 2.3e+08 1.5e+03 3.2e+04 47100100 99 98 100100100100100 73399
VecTDot            12000 1.0 5.0107e+00 1.3 6.48e+08 1.0 0.0e+00 0.0e+00 1.2e+04  2  2  0  0 37   5  2  0  0 38 27933
VecNorm             8000 1.0 2.0433e+00 1.1 4.32e+08 1.0 0.0e+00 0.0e+00 8.0e+03  1  1  0  0 25   2  1  0  0 25 45667
VecScale           30000 1.0 1.7645e-01 1.7 8.12e+07 2.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 80243
VecCopy             1000 1.0 8.1942e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet            108000 1.0 1.3471e+00 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY            12000 1.0 8.1873e-01 1.2 6.48e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   1  2  0  0  0 170957
VecAYPX            36000 1.0 1.1726e+00 1.3 5.50e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0 100259
VecScatterBegin   127000 1.0 4.3927e+00 2.1 0.00e+00 0.0 2.3e+08 1.5e+03 0.0e+00  2  0100 99  0   4  0100100  0     0
VecScatterEnd     127000 1.0 2.1218e+01 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  8  0  0  0  0  17  0  0  0  0     0
MatMult            37000 1.0 1.9416e+01 1.2 9.03e+09 1.1 7.9e+07 2.1e+03 0.0e+00  9 29 34 49  0  19 29 34 49  0 96389
MatMultAdd         30000 1.0 1.1328e+01 1.7 2.06e+09 1.3 4.1e+07 7.1e+02 0.0e+00  4  6 18  9  0  10  6 18  9  0 36548
MatMultTranspose   30000 1.0 1.0679e+01 1.6 2.06e+09 1.3 4.1e+07 7.1e+02 0.0e+00  4  6 18  9  0   9  6 18  9  0 38767
MatSolve            6000 0.0 1.0994e-01 0.0 7.48e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   680
MatSOR             60000 1.0 4.4873e+01 1.1 1.63e+10 1.2 7.1e+07 1.6e+03 1.2e+04 22 51 31 33 37  48 51 31 33 38 74798
MatResidual        30000 1.0 1.5853e+01 1.2 6.83e+09 1.2 7.1e+07 1.6e+03 0.0e+00  7 21 31 33  0  16 21 31 33  0 88112
PCSetUpOnBlocks     6000 1.0 9.1378e-02 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             6000 1.0 7.7131e+01 1.0 2.72e+10 1.2 2.3e+08 1.3e+03 1.2e+04 40 85 96 83 37  86 85 97 84 38 72361
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     1              8        10120     0.
     DMKSP interface     1              1          656     0.
              Vector     4             45      2361256     0.
              Matrix     0             59     14313348     0.
    Distributed Mesh     1              1         5248     0.
           Index Set     2             14       247728     0.
   IS L to G Mapping     1              1       131728     0.
   Star Forest Graph     2              2         1728     0.
     Discrete System     1              1          932     0.
         Vec Scatter     1             12       231168     0.
      Preconditioner     1              8         8692     0.
              Viewer     1              2         1680     0.
   Application Order     0              1     46656664     0.

--- Event Stage 1: First Solve

       Krylov Solver     7              0            0     0.
              Vector   137             96      3375264     0.
              Matrix   124             65     27659940     0.
      Matrix Coarsen     5              5         3180     0.
           Index Set   102             90     24085864     0.
   Star Forest Graph    10             10         8640     0.
         Vec Scatter    28             17        21488     0.
      Preconditioner     7              0            0     0.
              Viewer     2              0            0     0.
   Application Order     1              0            0     0.

--- Event Stage 2: Remaining Solves

              Vector 30000          30000   1940160000     0.
========================================================================================================================
Average time to get PetscTime(): 6.19888e-07
Average time for MPI_Barrier(): 1.00136e-05
Average time for zero size MPI_Send(): 6.69007e-06
#PETSc Option Table entries:
-gamg_est_ksp_type cg
-iterations 1000
-ksp_norm_type unpreconditioned
-ksp_rtol 1E-6
-ksp_type cg
-log_view
-mat_view ::load_balance
-mesh_size 1E-4
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-nodes_per_proc 30
-pc_gamg_type classical
-pc_mg_levels 6
-pc_type gamg
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=no --COPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --CXXOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --FOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --with-openmp=1 --download-sowing --download-fblaslapack=1 --download-scalapack=1 --download-metis=1 --download-parmetis=1 --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --PETSC_ARCH=intel-bdw-opt --PETSC_DIR=/home/jczhang/petsc
-----------------------------------------
Libraries compiled on 2018-06-05 18:40:55 on beboplogin2 
Machine characteristics: Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Using PETSc directory: /home/jczhang/petsc
Using PETSc arch: intel-bdw-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -g -O3 -DPETSC_KERNEL_USE_UNROLL_4 -fopenmp  
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O3 -DPETSC_KERNEL_USE_UNROLL_4  -fopenmp   
-----------------------------------------

Using include paths: -I/home/jczhang/petsc/include -I/home/jczhang/petsc/intel-bdw-opt/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -lpetsc -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib/debug_mt -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib -lscalapack -lflapack -lfblas -lparmetis -lmetis -lm -lX11 -lstdc++ -ldl -lmpifort -lmpi -lmpigi -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------
-------------- next part --------------
srun: Warning: can't honor --ntasks-per-node set to 36 which doesn't match the requested tasks 48 with the number of requested nodes 48. Ignoring --ntasks-per-node.
using 1728 of 1728 processes
30^3 unknowns per processor
total system size: 360^3
mesh size: 0.0001

Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 186300  avg 188550  max 189000
Mat Object: 1728 MPI processes
  type: mpiaij
  Mat Object: 1728 MPI processes
    type: mpiaij
    Load Balance - Nonzeros: Min 161490  avg 188550  max 188850
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 156360  avg 183219  max 189000
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 75656  avg 91164  max 94500
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 75656  avg 91164  max 94500
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 201530  avg 246725  max 256500
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 201530  avg 246725  max 256500
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 85956  avg 107132  max 111569
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 54571  avg 66550  max 69123
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 84688  avg 112657  max 117713
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 83920  avg 112441  max 117667
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 20241  avg 26366  max 27748
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 6042  avg 7328  max 7637
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 3423  avg 5508  max 5994
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 3047  avg 5197  max 5691
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 1105  avg 1934  max 2180
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 284  avg 479  max 584
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 137  avg 542  max 972
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 542  max 8392
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 284  avg 479  max 584
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 493  max 7084
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 145  max 2349
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 31  max 670
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 24  max 1100
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 24  max 42986
Mat Object: 1728 MPI processes
  type: mpiaij
  Load Balance - Nonzeros: Min 0  avg 31  max 670
initsolve: 8 iterations
solve 1: 6 iterations
solve 2: 6 iterations
solve 3: 6 iterations
solve 4: 6 iterations
solve 5: 6 iterations
solve 6: 6 iterations
solve 7: 6 iterations
solve 8: 6 iterations
solve 9: 6 iterations
solve 10: 6 iterations
solve 20: 6 iterations
solve 30: 6 iterations
solve 40: 6 iterations
solve 50: 6 iterations
solve 60: 6 iterations
solve 70: 6 iterations
solve 80: 6 iterations
solve 90: 6 iterations
solve 100: 6 iterations
solve 200: 6 iterations
solve 300: 6 iterations
solve 400: 6 iterations
solve 500: 6 iterations
solve 600: 6 iterations
solve 700: 6 iterations
solve 800: 6 iterations
solve 900: 6 iterations
solve 1000: 6 iterations

Time in solve():      120.025 s
Time in KSPSolve():   119.738 s (99.7606%)

Number of   KSP iterations (total): 6000
Number of solve iterations (total): 1000 (ratio: 6.00)

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./wstest on a intel-bdw-opt named bdw-0545 with 1728 processors, by jczhang Thu Jun  7 17:05:39 2018
Using Petsc Development GIT revision: v3.9.2-570-g68f20b90  GIT Date: 2018-06-04 15:39:16 +0200

                         Max       Max/Min        Avg      Total 
Time (sec):           2.315e+02      1.00001   2.315e+02
Objects:              3.544e+04      1.00003   3.544e+04
Flop:                 3.637e+10      1.16136   3.554e+10  6.141e+13
Flop/sec:            1.571e+08      1.16136   1.535e+08  2.653e+11
MPI Messages:         2.226e+06      4.17170   1.509e+06  2.608e+09
MPI Message Lengths:  2.235e+09      2.20450   1.340e+03  3.494e+12
MPI Reductions:       3.560e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 8.5928e-02   0.0%  0.0000e+00   0.0%  1.901e+04   0.0%  1.802e+03        0.0%  1.700e+01   0.0% 
 1:     First Solve: 1.1133e+02  48.1%  8.9706e+10   0.1%  8.086e+06   0.3%  3.671e+03        0.8%  5.810e+02   1.6% 
 2: Remaining Solves: 1.2004e+02  51.9%  6.1318e+13  99.9%  2.600e+09  99.7%  1.332e+03       99.1%  3.500e+04  98.3% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecSet                 2 1.0 1.2875e-04 4.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: First Solve

BuildTwoSided         10 1.0 4.9443e-03 1.6 0.00e+00 0.0 1.6e+05 4.0e+00 0.0e+00  0  0  0  0  0   0  0  2  0  0     0
BuildTwoSidedF        27 1.0 1.1099e+01 4.1 0.00e+00 0.0 1.2e+05 1.1e+04 0.0e+00  2  0  0  0  0   4  0  1  4  0     0
KSPSetUp               8 1.0 1.9672e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.6e+01  0  0  0  0  0   0  0  0  0  3     0
KSPSolve               1 1.0 1.1133e+02 1.0 6.83e+07 1.5 8.1e+06 3.7e+03 5.8e+02 48  0  0  1  2 100100100100100   806
VecTDot               16 1.0 9.3598e-03 1.7 8.64e+05 1.0 0.0e+00 0.0e+00 1.6e+01  0  0  0  0  0   0  2  0  0  3 159508
VecNorm               10 1.0 3.8018e-03 2.8 5.40e+05 1.0 0.0e+00 0.0e+00 1.0e+01  0  0  0  0  0   0  1  0  0  2 245440
VecScale              40 1.0 2.3422e-0320.4 1.08e+05 2.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 72283
VecCopy                1 1.0 1.5903e-04 4.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               172 1.0 3.5458e-03 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               16 1.0 1.1342e-03 1.3 8.64e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  2  0  0  0 1316389
VecAYPX               48 1.0 1.8997e-03 1.7 7.42e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  1  0  0  0 671749
VecAssemblyBegin       2 1.0 5.9843e-0562.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         2 1.0 7.6056e-0579.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin      171 1.0 6.8316e-03 2.3 0.00e+00 0.0 3.0e+06 1.4e+03 0.0e+00  0  0  0  0  0   0  0 37 14  0     0
VecScatterEnd        171 1.0 6.3600e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult               49 1.0 3.5715e-02 1.7 1.19e+07 1.1 1.0e+06 2.0e+03 0.0e+00  0  0  0  0  0   0 23 12  7  0 565174
MatMultAdd            40 1.0 4.9321e-02 4.7 2.75e+06 1.3 5.3e+05 6.6e+02 0.0e+00  0  0  0  0  0   0  5  7  1  0 92805
MatMultTranspose      40 1.0 2.4180e-02 2.9 2.75e+06 1.3 5.3e+05 6.6e+02 0.0e+00  0  0  0  0  0   0  5  7  1  0 189301
MatSolve               8 0.0 1.4651e-03 0.0 1.89e+06 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1293
MatSOR                80 1.0 7.9000e-02 1.3 2.18e+07 1.2 9.2e+05 1.5e+03 1.6e+01  0  0  0  0  0   0 41 11  5  3 464960
MatLUFactorSym         1 1.0 4.4470e-03373.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         1 1.0 1.3872e-024848.7 2.12e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1532
MatResidual           40 1.0 3.0815e-02 2.0 9.11e+06 1.2 9.2e+05 1.5e+03 0.0e+00  0  0  0  0  0   0 17 11  5  0 497042
MatAssemblyBegin      82 1.0 1.1102e+01 4.1 0.00e+00 0.0 1.2e+05 1.1e+04 0.0e+00  2  0  0  0  0   4  0  1  4  0     0
MatAssemblyEnd        82 1.0 1.2929e-01 1.1 0.00e+00 0.0 1.1e+06 5.2e+02 2.1e+02  0  0  0  0  1   0  0 14  2 36     0
MatGetRow        3100266 1.2 5.0643e+01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 21  0  0  0  0  43  0  0  0  0     0
MatGetRowIJ            1 0.0 1.6308e-04 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCreateSubMats       5 1.0 1.9433e-01 2.2 0.00e+00 0.0 1.0e+06 1.6e+04 1.0e+01  0  0  0  0  0   0  0 13 56  2     0
MatCreateSubMat        5 1.0 1.8586e+00 1.0 0.00e+00 0.0 3.7e+05 1.3e+04 8.4e+01  1  0  0  0  0   2  0  5 16 14     0
MatGetOrdering         1 0.0 4.2415e-04 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatIncreaseOvrlp       5 1.0 8.5395e-02 1.1 0.00e+00 0.0 4.6e+05 9.9e+02 1.0e+01  0  0  0  0  0   0  0  6  2  2     0
MatCoarsen             5 1.0 2.5278e-02 1.2 0.00e+00 0.0 9.7e+05 5.5e+02 5.2e+01  0  0  0  0  0   0  0 12  2  9     0
MatZeroEntries         5 1.0 1.6418e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView               26 1.0 3.8725e+00 1.0 0.00e+00 0.0 3.3e+05 1.4e+04 5.1e+01  2  0  0  0  0   3  0  4 16  9     0
MatPtAP                5 1.0 2.0472e-01 1.0 1.11e+07 1.3 1.1e+06 2.5e+03 8.3e+01  0  0  0  0  0   0 21 14  9 14 89957
MatPtAPSymbolic        5 1.0 1.2353e-01 1.0 0.00e+00 0.0 5.8e+05 2.7e+03 3.5e+01  0  0  0  0  0   0  0  7  5  6     0
MatPtAPNumeric         5 1.0 8.0794e-02 1.0 1.11e+07 1.3 5.5e+05 2.3e+03 4.5e+01  0  0  0  0  0   0 21  7  4  8 227941
MatGetLocalMat         5 1.0 2.8760e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          5 1.0 4.8778e-03 1.8 0.00e+00 0.0 3.4e+05 3.4e+03 0.0e+00  0  0  0  0  0   0  0  4  4  0     0
SFSetGraph            10 1.0 1.1182e-0419.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp               10 1.0 8.0597e-03 1.2 0.00e+00 0.0 4.8e+05 5.8e+02 0.0e+00  0  0  0  0  0   0  0  6  1  0     0
SFBcastBegin          62 1.0 2.1942e-03 2.3 0.00e+00 0.0 1.0e+06 6.4e+02 0.0e+00  0  0  0  0  0   0  0 12  2  0     0
SFBcastEnd            62 1.0 6.9718e-03 5.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
GAMG: createProl       5 1.0 1.0694e+02 1.0 0.00e+00 0.0 3.6e+06 5.1e+03 2.8e+02 46  0  0  1  1  96  0 44 61 48     0
GAMG: partLevel        5 1.0 2.7904e-01 1.0 1.11e+07 1.3 1.2e+06 2.4e+03 1.9e+02  0  0  0  0  1   0 21 14 10 33 65998
  repartition          2 1.0 1.8520e-03 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01  0  0  0  0  0   0  0  0  0  2     0
  Invert-Sort          2 1.0 4.2000e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
  Move A               2 1.0 4.0763e-02 1.0 0.00e+00 0.0 1.6e+04 7.9e+02 3.6e+01  0  0  0  0  0   0  0  0  0  6     0
  Move P               2 1.0 2.8355e-02 1.1 0.00e+00 0.0 2.2e+04 1.3e+01 3.6e+01  0  0  0  0  0   0  0  0  0  6     0
PCSetUp                2 1.0 1.0727e+02 1.0 2.98e+07 3.5 4.8e+06 4.4e+03 4.9e+02 46  0  0  1  1  96 21 59 71 85   172
PCSetUpOnBlocks        8 1.0 1.8798e-02202.7 2.12e+07 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1130
PCApply                8 1.0 1.5085e-01 1.0 5.39e+07 1.8 2.9e+06 1.2e+03 1.6e+01  0  0  0  0  0   0 68 36 11  3 405880

--- Event Stage 2: Remaining Solves

KSPSolve            1000 1.0 1.1975e+02 1.0 3.63e+10 1.2 2.6e+09 1.3e+03 3.5e+04 52100100 99 98 100100100100100 512039
VecTDot            13000 1.0 9.7158e+00 1.3 7.02e+08 1.0 0.0e+00 0.0e+00 1.3e+04  4  2  0  0 37   7  2  0  0 37 124852
VecNorm             8000 1.0 2.9320e+00 1.1 4.32e+08 1.0 0.0e+00 0.0e+00 8.0e+03  1  1  0  0 22   2  1  0  0 23 254601
VecScale           35000 1.0 2.7666e-01 2.7 9.47e+07 2.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 535462
VecCopy             1000 1.0 8.3770e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet            126000 1.0 1.5955e+00 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPY            12000 1.0 8.3211e-01 1.2 6.48e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   1  2  0  0  0 1345675
VecAYPX            41000 1.0 1.3790e+00 1.5 5.92e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0 737825
VecScatterBegin   147000 1.0 5.5527e+00 2.3 0.00e+00 0.0 2.6e+09 1.3e+03 0.0e+00  2  0100 99  0   4  0100100  0     0
VecScatterEnd     147000 1.0 3.3839e+01 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 10  0  0  0  0  20  0  0  0  0     0
MatMult            42000 1.0 2.4343e+01 1.4 1.01e+10 1.1 8.7e+08 1.9e+03 0.0e+00  9 28 33 48  0  16 28 33 48  0 703788
MatMultAdd         35000 1.0 2.0074e+01 2.3 2.40e+09 1.3 4.6e+08 6.6e+02 0.0e+00  7  7 18  9  0  14  7 18  9  0 199518
MatMultTranspose   35000 1.0 1.7168e+01 2.3 2.40e+09 1.3 4.6e+08 6.6e+02 0.0e+00  4  7 18  9  0   8  7 18  9  0 233286
MatSolve            7000 0.0 1.3333e+00 0.0 1.66e+09 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1244
MatSOR             70000 1.0 5.9088e+01 1.1 1.90e+10 1.2 8.0e+08 1.5e+03 1.4e+04 24 52 31 34 39  46 52 31 34 40 542874
MatResidual        35000 1.0 2.1124e+01 1.5 7.97e+09 1.2 8.0e+08 1.5e+03 0.0e+00  7 22 31 34  0  14 22 31 34  0 634453
PCSetUpOnBlocks     7000 1.0 1.1204e-0119.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCApply             7000 1.0 1.0287e+02 1.0 3.18e+10 1.2 2.5e+09 1.2e+03 1.4e+04 44 87 97 85 39  85 87 97 86 40 519965
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     1              8        10120     0.
     DMKSP interface     1              1          656     0.
              Vector     4             45      2366712     0.
              Matrix     0             59     16548712     0.
    Distributed Mesh     1              1         5248     0.
           Index Set     2             14       305000     0.
   IS L to G Mapping     1              1       131728     0.
   Star Forest Graph     2              2         1728     0.
     Discrete System     1              1          932     0.
         Vec Scatter     1             12       231168     0.
      Preconditioner     1              8         8692     0.
              Viewer     1              2         1680     0.
   Application Order     0              1    373248664     0.

--- Event Stage 1: First Solve

       Krylov Solver     7              0            0     0.
              Vector   142            101      3702616     0.
              Matrix   124             65     27964988     0.
      Matrix Coarsen     5              5         3180     0.
           Index Set   102             90    187439200     0.
   Star Forest Graph    10             10         8640     0.
         Vec Scatter    28             17        21488     0.
      Preconditioner     7              0            0     0.
              Viewer     2              0            0     0.
   Application Order     1              0            0     0.

--- Event Stage 2: Remaining Solves

              Vector 35000          35000   2262792000     0.
========================================================================================================================
Average time to get PetscTime(): 6.19888e-07
Average time for MPI_Barrier(): 1.27792e-05
Average time for zero size MPI_Send(): 6.85591e-06
#PETSc Option Table entries:
-gamg_est_ksp_type cg
-iterations 1000
-ksp_norm_type unpreconditioned
-ksp_rtol 1E-6
-ksp_type cg
-log_view
-mat_view ::load_balance
-mesh_size 1E-4
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-nodes_per_proc 30
-pc_gamg_type classical
-pc_mg_levels 6
-pc_type gamg
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=no --COPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --CXXOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --FOPTFLAGS="-g -O3 -DPETSC_KERNEL_USE_UNROLL_4" --with-openmp=1 --download-sowing --download-fblaslapack=1 --download-scalapack=1 --download-metis=1 --download-parmetis=1 --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --PETSC_ARCH=intel-bdw-opt --PETSC_DIR=/home/jczhang/petsc
-----------------------------------------
Libraries compiled on 2018-06-05 18:40:55 on beboplogin2 
Machine characteristics: Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Using PETSc directory: /home/jczhang/petsc
Using PETSc arch: intel-bdw-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -g -O3 -DPETSC_KERNEL_USE_UNROLL_4 -fopenmp  
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O3 -DPETSC_KERNEL_USE_UNROLL_4  -fopenmp   
-----------------------------------------

Using include paths: -I/home/jczhang/petsc/include -I/home/jczhang/petsc/intel-bdw-opt/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -lpetsc -Wl,-rpath,/home/jczhang/petsc/intel-bdw-opt/lib -L/home/jczhang/petsc/intel-bdw-opt/lib -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib/debug_mt -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-mpi-2018.0.128-afy57nutkjquvasoogql4bmgwdjdhtbi/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc/x86_64-suse-linux/4.9.1 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib/gcc -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib64 -Wl,-rpath,/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -L/blues/gpfs/home/jczhang/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/hpctoolkit-2017.06-557cxm5zivsflxdq5sqgcx3j6z7ybn6n/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64_lin/gcc4.7 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/intel-17.0.4/intel-mkl-2017.3.196-v7uuj6zmthzln35n2hb7i5u5ybncv5ev/lib -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/tbb/lib/intel64/gcc4.4 -Wl,-rpath,/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -L/blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-17.0.4-74uvhjiulyqgvsmywifbbuo46v5n42xc/lib/intel64 -Wl,-rpath,/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -L/blues/gpfs/home/software/bebop/craype-17.02-1-knl/opt/gcc/4.9.1/snos/lib -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib/debug_mt -Wl,-rpath,/opt/intel/mpi-rt/2017.0.0/intel64/lib -lscalapack -lflapack -lfblas -lparmetis -lmetis -lm -lX11 -lstdc++ -ldl -lmpifort -lmpi -lmpigi -lrt -lpthread -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------