[sajid@beboplogin3 matter_repeat]$ aps --report=/blues/gpfs/home/sajid/packages/xwp_petsc/2d/matter_repeat/aps_result_20190301
| Summary information
|--------------------------------------------------------------------
  Application                : ex_modify
  Report creation date       : 2019-03-01 12:50:05
  Number of ranks            : 64
  Ranks per node             : 64
  HW Platform                : Intel(R) Processor code named Knights Landing
  Logical core count per node: 272
  Collector type             : Driverless Perf system-wide counting
  Used statistics            : /blues/gpfs/home/sajid/packages/xwp_petsc/2d/matter_repeat/aps_result_20190301
|
| Your application is backend bound.
| Use memory access analysis tools like Intel(R) VTune(TM) Amplifier for a  detailed metric breakdown by memory hierarchy, memory bandwidth, and correlation by memory objects.
|
  Elapsed time:              207.93 sec
  CPI Rate:                    2.48
| The CPI value may be too high.
| This could be caused by such issues as memory stalls, instruction starvation,
| branch misprediction, or long latency instructions.
| Use Intel(R) VTune(TM) Amplifier General Exploration analysis to specify
| particular reasons of high CPI.
  MPI Time:                   34.35 sec            16.52%
| Your application is MPI bound. This may be caused by high busy wait time
| inside the library (imbalance), non-optimal communication schema or MPI
| library settings. Explore the MPI Imbalance metric if it is available or use
| MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore
| possible performance bottlenecks.
    MPI Imbalance:            15.72 sec             7.56%
| The application workload is not well balanced between MPI ranks.For more
| details about the MPI communication scheme use Intel(R) Trace Analyzer and
| Collector available as part of Intel(R) Parallel Studio Cluster Edition.
    Top 5 MPI functions (avg time):
        Init_thread                  7.80 sec  ( 3.75 %)
        Allreduce                    6.84 sec  ( 3.29 %)
        Iprobe                       5.55 sec  ( 2.67 %)
        Test                         4.54 sec  ( 2.18 %)
        Allgather                    3.01 sec  ( 1.45 %)
  Back-End Stalls:                               68.60%
| A significant proportion of pipeline slots remain empty. When operations take
| too long in the back-end, they introduce bubbles in the pipeline that
| ultimately cause fewer pipeline slots containing useful work to be retired per
| cycle than the machine is capable of supporting. This opportunity cost results
| in slower execution. Long-latency operations like division and memory
| operations can cause this, as can too many operations being directed to a
| single execution port (for example, more multiplication operations arriving in
| the back-end per cycle than the execution unit can support). Explore second
| level metrics or use Intel(R) VTune(TM) Amplifier Memory Access analysis to
| learn more.
    L2 Hit Bound:                              5.80% of cycles
    L2 Miss Bound:                             8.40% of cycles
    Average DRAM Bandwidth:                   49.79  GB/s
    Average MCDRAM Bandwidth:                  0.08  GB/s
  SIMD Instructions per Cycle:                    0.09
| The metric value indicates that FPU might be underutilized. This can be a
| result of significant fraction of non-floating point instructions, inefficient
| vectorization because of legacy vector instruction set or memory access
| pattern issues, or different kinds of stalls in the code execution. Explore
| second level metrics to identify the next steps in FPU usage improvements.
       % of Packed SIMD Instr.:                  99.20%
       % of Scalar SIMD Instr.:                   0.80%
 Disk I/O Bound:              0.00 sec ( 0.00 %)
       Data read:             0.0  KB
       Data written:          0.0  KB
 Memory Footprint:
 Resident:
       Per node:
           Peak resident set size    :        45492.11 MB (node apsxrmd-0001)
           Average resident set size :        45492.11 MB
       Per rank:
           Peak resident set size    :          722.28 MB (rank 44)
           Average resident set size :          710.81 MB
 Virtual:
       Per node:
           Peak memory consumption    :       144002.84 MB (node apsxrmd-0001)
           Average memory consumption :       144002.84 MB
       Per rank:
           Peak memory consumption    :         2261.47 MB (rank 13)
           Average memory consumption :         2250.04 MB

Graphical representation of this data is available in the HTML report: /blues/gpfs/home/sajid/packages/xwp_petsc/2d/matter_repeat/aps_report_20190301_125403.html