[sajid@beboplogin3 matter_repeat]$ aps --report=/blues/gpfs/home/sajid/packages/xwp_petsc/2d/matter_repeat/aps_result_20190301 | Summary information |-------------------------------------------------------------------- Application : ex_modify Report creation date : 2019-03-01 12:50:05 Number of ranks : 64 Ranks per node : 64 HW Platform : Intel(R) Processor code named Knights Landing Logical core count per node: 272 Collector type : Driverless Perf system-wide counting Used statistics : /blues/gpfs/home/sajid/packages/xwp_petsc/2d/matter_repeat/aps_result_20190301 | | Your application is backend bound. | Use memory access analysis tools like Intel(R) VTune(TM) Amplifier for a detailed metric breakdown by memory hierarchy, memory bandwidth, and correlation by memory objects. | Elapsed time: 207.93 sec CPI Rate: 2.48 | The CPI value may be too high. | This could be caused by such issues as memory stalls, instruction starvation, | branch misprediction, or long latency instructions. | Use Intel(R) VTune(TM) Amplifier General Exploration analysis to specify | particular reasons of high CPI. MPI Time: 34.35 sec 16.52% | Your application is MPI bound. This may be caused by high busy wait time | inside the library (imbalance), non-optimal communication schema or MPI | library settings. Explore the MPI Imbalance metric if it is available or use | MPI profiling tools like Intel(R) Trace Analyzer and Collector to explore | possible performance bottlenecks. MPI Imbalance: 15.72 sec 7.56% | The application workload is not well balanced between MPI ranks.For more | details about the MPI communication scheme use Intel(R) Trace Analyzer and | Collector available as part of Intel(R) Parallel Studio Cluster Edition. Top 5 MPI functions (avg time): Init_thread 7.80 sec ( 3.75 %) Allreduce 6.84 sec ( 3.29 %) Iprobe 5.55 sec ( 2.67 %) Test 4.54 sec ( 2.18 %) Allgather 3.01 sec ( 1.45 %) Back-End Stalls: 68.60% | A significant proportion of pipeline slots remain empty. When operations take | too long in the back-end, they introduce bubbles in the pipeline that | ultimately cause fewer pipeline slots containing useful work to be retired per | cycle than the machine is capable of supporting. This opportunity cost results | in slower execution. Long-latency operations like division and memory | operations can cause this, as can too many operations being directed to a | single execution port (for example, more multiplication operations arriving in | the back-end per cycle than the execution unit can support). Explore second | level metrics or use Intel(R) VTune(TM) Amplifier Memory Access analysis to | learn more. L2 Hit Bound: 5.80% of cycles L2 Miss Bound: 8.40% of cycles Average DRAM Bandwidth: 49.79 GB/s Average MCDRAM Bandwidth: 0.08 GB/s SIMD Instructions per Cycle: 0.09 | The metric value indicates that FPU might be underutilized. This can be a | result of significant fraction of non-floating point instructions, inefficient | vectorization because of legacy vector instruction set or memory access | pattern issues, or different kinds of stalls in the code execution. Explore | second level metrics to identify the next steps in FPU usage improvements. % of Packed SIMD Instr.: 99.20% % of Scalar SIMD Instr.: 0.80% Disk I/O Bound: 0.00 sec ( 0.00 %) Data read: 0.0 KB Data written: 0.0 KB Memory Footprint: Resident: Per node: Peak resident set size : 45492.11 MB (node apsxrmd-0001) Average resident set size : 45492.11 MB Per rank: Peak resident set size : 722.28 MB (rank 44) Average resident set size : 710.81 MB Virtual: Per node: Peak memory consumption : 144002.84 MB (node apsxrmd-0001) Average memory consumption : 144002.84 MB Per rank: Peak memory consumption : 2261.47 MB (rank 13) Average memory consumption : 2250.04 MB Graphical representation of this data is available in the HTML report: /blues/gpfs/home/sajid/packages/xwp_petsc/2d/matter_repeat/aps_report_20190301_125403.html