[petsc-users] Configuring PETSc for KNL

Tue Apr 4 15:34:00 CDT 2017

Hey,

here's some data on what you should see with STREAM when comparing 
against conventional XEONs:
https://www.karlrupp.net/2016/07/knights-landing-vs-knights-corner-haswell-ivy-bridge-and-sandy-bridge-stream-benchmark-results/

Note that MCDRAM only pays off if you can keep enough cores busy. Thus, 
anything below 16 processes is unlikely to give you any benefit. Also, 
your working set must be large enough not to stay in L3 on Haswell (I 
think this was already mentioned earlier in this thread).

Best regards,
Karli

On 04/04/2017 06:05 PM, Matthew Knepley wrote:
> On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang <jychang48 at gmail.com
> <mailto:jychang48 at gmail.com>> wrote:
>
>     Thanks everyone for the helpful advice. So I tried all the
>     suggestions including using libsci. The performance did not improve
>     for my particular runs, which I think suggests the problem
>     parameters chosen for my tests (SNES ex48) are not optimal for KNL.
>     Does anyone have example test runs I could reproduce that compare
>     the performance between KNL and Haswell/Ivybridge/etc?
>
>
> Lets try to see what is going on with your existing data first.
>
> First, I think that main thing is to make sure we are using MCDRAM.
> Everything else in KNL
> is window dressing (IMHO). All we have to look at is something like
> MAXPY. You can get the
> bandwidth estimate from the flop rate and problem size (I think), and we
> can at least get
> bandwidth ratios between Haswell and KNL with that number.
>
>    Matt
>
>
>     On Mon, Apr 3, 2017 at 3:06 PM Richard Mills
>     <richardtmills at gmail.com <mailto:richardtmills at gmail.com>> wrote:
>
>         Yes, one should rely on MKL (or Cray LibSci, if using the Cray
>         toolchain) on Cori.  But I'm guessing that this will make no
>         noticeable difference for what Justin is doing.
>
>         --Richard
>
>         On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <keceli at gmail.com
>         <mailto:keceli at gmail.com>> wrote:
>
>             How about replacing --download-fblaslapack with vendor
>             specific BLAS/LAPACK?
>
>             Murat
>
>             On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills
>             <richardtmills at gmail.com <mailto:richardtmills at gmail.com>>
>             wrote:
>
>                 On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong
>                 <hongzhang at anl.gov <mailto:hongzhang at anl.gov>> wrote:
>
>
>>                     On Apr 3, 2017, at 1:44 PM, Justin Chang
>>                     <jychang48 at gmail.com <mailto:jychang48 at gmail.com>>
>>                     wrote:
>>
>>                     Richard,
>>
>>                     This is what my job script looks like:
>>
>>                     #!/bin/bash
>>                     #SBATCH -N 16
>>                     #SBATCH -C knl,quad,flat
>>                     #SBATCH -p regular
>>                     #SBATCH -J knlflat1024
>>                     #SBATCH -L SCRATCH
>>                     #SBATCH -o knlflat1024.o%j
>>                     #SBATCH --mail-type=ALL
>>                     #SBATCH --mail-user=jychang48 at gmail.com
>>                     <mailto:jychang48 at gmail.com>
>>                     #SBATCH -t 00:20:00
>>
>>                     #run the application:
>>                     cd $SCRATCH/Icesheet
>>                     sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
>>                     srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1
>>                     /tmp/ex48cori -M 128 -N 128 -P 16 -thi_mat_type
>>                     baij -pc_type mg -mg_coarse_pc_type gamg -da_refine 1
>>
>
>                     Maybe it is a typo. It should be numactl -m 1.
>
>
>                 "-p 1" will also work.  "-p" means to "prefer" NUMA node
>                 1 (the MCDRAM), whereas "-m" means to use only NUMA node
>                 1.  In the former case, MCDRAM will be used for
>                 allocations until the available memory there has been
>                 exhausted, and then things will spill over into the
>                 DRAM.  One would think that "-m" would be better for
>                 doing performance studies, but on systems where the
>                 nodes have swap space enabled, you can get terrible
>                 performance if your code's working set exceeds the size
>                 of the MCDRAM, as the system will obediently obey your
>                 wishes to not use the DRAM and go straight to the swap
>                 disk!  I assume the Cori nodes don't have swap space,
>                 though I could be wrong.
>
>
>>                     According to the NERSC info pages, they say to add
>>                     the "numactl" if using flat mode. Previously I
>>                     tried cache mode but the performance seems to be
>>                     unaffected.
>
>                     Using cache mode should give similar performance as
>                     using flat mode with the numactl option. But both
>                     approaches should be significant faster than using
>                     flat mode without the numactl option. I usually see
>                     over 3X speedup. You can also do such comparison to
>                     see if the high-bandwidth memory is working properly.
>
>>                     I also comparerd 256 haswell nodes vs 256 KNL
>>                     nodes and haswell is nearly 4-5x faster. Though I
>>                     suspect this drastic change has much to do with
>>                     the initial coarse grid size now being extremely
>>                     small.
>
>                 I think you may be right about why you see such a big
>                 difference.  The KNL nodes need enough work to be able
>                 to use the SIMD lanes effectively.  Also, if your
>                 problem gets small enough, then it's going to be able to
>                 fit in the Haswell's L3 cache.  Although KNL has MCDRAM
>                 and this delivers *a lot* more memory bandwidth than the
>                 DDR4 memory, it will deliver a lot less bandwidth than
>                 the Haswell's L3.
>
>>                     I'll give the COPTFLAGS a try and see what happens
>
>                     Make sure to use --with-memalign=64 for data
>                     alignment when configuring PETSc.
>
>
>                 Ah, yes, I forgot that.  Thanks for mentioning it, Hong!
>
>
>                     The option -xMIC-AVX512 would improve the
>                     vectorization performance. But it may cause problems
>                     for the MPIBAIJ format for some unknown reason.
>                     MPIAIJ should work fine with this option.
>
>
>                 Hmm.  Try both, and, if you see worse performance with
>                 MPIBAIJ, let us know and I'll try to figure this out.
>
>                 --Richard
>
>
>
>                     Hong (Mr.)
>
>>                     Thanks,
>>                     Justin
>>
>>                     On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills
>>                     <richardtmills at gmail.com
>>                     <mailto:richardtmills at gmail.com>> wrote:
>>
>>                         Hi Justin,
>>
>>                         How is the MCDRAM (on-package "high-bandwidth
>>                         memory") configured for your KNL runs?  And if
>>                         it is in "flat" mode, what are you doing to
>>                         ensure that you use the MCDRAM?  Doing this
>>                         wrong seems to be one of the most common
>>                         reasons for unexpected poor performance on KNL.
>>
>>                         I'm not that familiar with the environment on
>>                         Cori, but I think that if you are building for
>>                         KNL, you should add "-xMIC-AVX512" to your
>>                         compiler flags to explicitly instruct the
>>                         compiler to use the AVX512 instruction set.  I
>>                         usually use something along the lines of
>>
>>                           'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'
>>
>>                         (The "-g" just adds symbols, which make the
>>                         output from performance profiling tools much
>>                         more useful.)
>>
>>                         That said, I think that if you are comparing
>>                         1024 Haswell cores vs. 1024 KNL cores (so
>>                         double the number of Haswell nodes), I'm not
>>                         surprised that the simulations are almost
>>                         twice as fast using the Haswell nodes.  Keep
>>                         in mind that individual KNL cores are much
>>                         less powerful than an individual Haswell
>>                         node.  You are also using roughly twice the
>>                         power footprint (dual socket Haswell node
>>                         should be roughly equivalent to a KNL node, I
>>                         believe).  How do things look on when you
>>                         compare equal nodes?
>>
>>                         Cheers,
>>                         Richard
>>
>>                         On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang
>>                         <jychang48 at gmail.com
>>                         <mailto:jychang48 at gmail.com>> wrote:
>>
>>                             Hi all,
>>
>>                             On NERSC's Cori I have the following
>>                             configure options for PETSc:
>>
>>                             ./configure --download-fblaslapack
>>                             --with-cc=cc --with-clib-autodetect=0
>>                             --with-cxx=CC --with-cxxlib-autodetect=0
>>                             --with-debugging=0 --with-fc=ftn
>>                             --with-fortranlib-autodetect=0
>>                             --with-mpiexec=srun
>>                             --with-64-bit-indices=1 COPTFLAGS=-O3
>>                             CXXOPTFLAGS=-O3 FOPTFLAGS=-O3
>>                             PETSC_ARCH=arch-cori-opt
>>
>>                             Where I swapped out the default Intel
>>                             programming environment with that of Cray
>>                             (e.g., 'module switch PrgEnv-intel/6.0.3
>>                             PrgEnv-cray/6.0.3'). I want to document
>>                             the performance difference between Cori's
>>                             Haswell and KNL processors.
>>
>>                             When I run a PETSc example like SNES ex48
>>                             on 1024 cores (32 Haswell and 16 KNL
>>                             nodes), the simulations are almost twice
>>                             as fast on Haswell nodes. Which leads me
>>                             to suspect that I am not doing something
>>                             right for KNL. Does anyone know what are
>>                             some "optimal" configure options for
>>                             running PETSc on KNL?
>>
>>                             Thanks,
>>                             Justin
>>
>>
>>
>
>
>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener