[petsc-users] Configuring PETSc for KNL

Tue Apr 4 10:57:59 CDT 2017

Thanks everyone for the helpful advice. So I tried all the suggestions
including using libsci. The performance did not improve for my particular
runs, which I think suggests the problem parameters chosen for my tests
(SNES ex48) are not optimal for KNL. Does anyone have example test runs I
could reproduce that compare the performance between KNL and
Haswell/Ivybridge/etc?

On Mon, Apr 3, 2017 at 3:06 PM Richard Mills <richardtmills at gmail.com>
wrote:

> Yes, one should rely on MKL (or Cray LibSci, if using the Cray toolchain)
> on Cori.  But I'm guessing that this will make no noticeable difference for
> what Justin is doing.
>
> --Richard
>
> On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <keceli at gmail.com> wrote:
>
> How about replacing --download-fblaslapack with vendor specific
> BLAS/LAPACK?
>
> Murat
>
> On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <richardtmills at gmail.com>
> wrote:
>
> On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <hongzhang at anl.gov> wrote:
>
>
> On Apr 3, 2017, at 1:44 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
> Richard,
>
> This is what my job script looks like:
>
> #!/bin/bash
> #SBATCH -N 16
> #SBATCH -C knl,quad,flat
> #SBATCH -p regular
> #SBATCH -J knlflat1024
> #SBATCH -L SCRATCH
> #SBATCH -o knlflat1024.o%j
> #SBATCH --mail-type=ALL
> #SBATCH --mail-user=jychang48 at gmail.com
> #SBATCH -t 00:20:00
>
> #run the application:
> cd $SCRATCH/Icesheet
> sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
> srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N
> 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine
> 1
>
>
> Maybe it is a typo. It should be numactl -m 1.
>
>
> "-p 1" will also work.  "-p" means to "prefer" NUMA node 1 (the MCDRAM),
> whereas "-m" means to use only NUMA node 1.  In the former case, MCDRAM
> will be used for allocations until the available memory there has been
> exhausted, and then things will spill over into the DRAM.  One would think
> that "-m" would be better for doing performance studies, but on systems
> where the nodes have swap space enabled, you can get terrible performance
> if your code's working set exceeds the size of the MCDRAM, as the system
> will obediently obey your wishes to not use the DRAM and go straight to the
> swap disk!  I assume the Cori nodes don't have swap space, though I could
> be wrong.
>
>
> According to the NERSC info pages, they say to add the "numactl" if using
> flat mode. Previously I tried cache mode but the performance seems to be
> unaffected.
>
>
> Using cache mode should give similar performance as using flat mode with
> the numactl option. But both approaches should be significant faster than
> using flat mode without the numactl option. I usually see over 3X speedup.
> You can also do such comparison to see if the high-bandwidth memory is
> working properly.
>
> I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is nearly
> 4-5x faster. Though I suspect this drastic change has much to do with the
> initial coarse grid size now being extremely small.
>
> I think you may be right about why you see such a big difference.  The KNL
> nodes need enough work to be able to use the SIMD lanes effectively.  Also,
> if your problem gets small enough, then it's going to be able to fit in the
> Haswell's L3 cache.  Although KNL has MCDRAM and this delivers *a lot* more
> memory bandwidth than the DDR4 memory, it will deliver a lot less bandwidth
> than the Haswell's L3.
>
> I'll give the COPTFLAGS a try and see what happens
>
>
> Make sure to use --with-memalign=64 for data alignment when configuring
> PETSc.
>
>
> Ah, yes, I forgot that.  Thanks for mentioning it, Hong!
>
>
> The option -xMIC-AVX512 would improve the vectorization performance. But
> it may cause problems for the MPIBAIJ format for some unknown reason.
> MPIAIJ should work fine with this option.
>
>
> Hmm.  Try both, and, if you see worse performance with MPIBAIJ, let us
> know and I'll try to figure this out.
>
> --Richard
>
>
>
> Hong (Mr.)
>
> Thanks,
> Justin
>
> On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <richardtmills at gmail.com>
> wrote:
>
> Hi Justin,
>
> How is the MCDRAM (on-package "high-bandwidth memory") configured for your
> KNL runs?  And if it is in "flat" mode, what are you doing to ensure that
> you use the MCDRAM?  Doing this wrong seems to be one of the most common
> reasons for unexpected poor performance on KNL.
>
> I'm not that familiar with the environment on Cori, but I think that if
> you are building for KNL, you should add "-xMIC-AVX512" to your compiler
> flags to explicitly instruct the compiler to use the AVX512 instruction
> set.  I usually use something along the lines of
>
>   'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'
>
> (The "-g" just adds symbols, which make the output from performance
> profiling tools much more useful.)
>
> That said, I think that if you are comparing 1024 Haswell cores vs. 1024
> KNL cores (so double the number of Haswell nodes), I'm not surprised that
> the simulations are almost twice as fast using the Haswell nodes.  Keep in
> mind that individual KNL cores are much less powerful than an individual
> Haswell node.  You are also using roughly twice the power footprint (dual
> socket Haswell node should be roughly equivalent to a KNL node, I
> believe).  How do things look on when you compare equal nodes?
>
> Cheers,
> Richard
>
> On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <jychang48 at gmail.com> wrote:
>
> Hi all,
>
> On NERSC's Cori I have the following configure options for PETSc:
>
> ./configure --download-fblaslapack --with-cc=cc --with-clib-autodetect=0
> --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn
> --with-fortranlib-autodetect=0 --with-mpiexec=srun --with-64-bit-indices=1
> COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt
>
> Where I swapped out the default Intel programming environment with that of
> Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3'). I want
> to document the performance difference between Cori's Haswell and KNL
> processors.
>
> When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell and 16
> KNL nodes), the simulations are almost twice as fast on Haswell nodes.
> Which leads me to suspect that I am not doing something right for KNL. Does
> anyone know what are some "optimal" configure options for running PETSc on
> KNL?
>
> Thanks,
> Justin
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/76684422/attachment-0001.html>