[petsc-users] Configuring PETSc for KNL

Tue Apr 4 14:44:39 CDT 2017

Attached are the job output files (which include -log_view) for SNES ex48
run on a single haswell and knl node (32 and 64 cores respectively).
Started off with a coarse grid of size 40x40x5 and ran three different
tests with -da_refine 1/2/3 and -pc_type mg

What's interesting/strange is that if i try to do -da_refine 4 on KNL, i
get a slurm error that says: "slurmstepd: error: Step 4408401.0 exceeded
memory limit (96737652 > 94371840), being killed" but it runs perfectly
fine on Haswell. Adding -pc_mg_levels 7 enables KNL to run on -da_refine 4
but the performance still does not beat out haswell.

The performance spectrum (dofs/sec) for 1-3 levels of refinement looks like
this:

Haswell:
2.416e+03
1.490e+04
5.188e+04

KNL:
9.308e+02
7.257e+03
3.838e+04

Which might suggest to me that KNL performs better with larger problem
sizes.

On Tue, Apr 4, 2017 at 11:05 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> Thanks everyone for the helpful advice. So I tried all the suggestions
>> including using libsci. The performance did not improve for my particular
>> runs, which I think suggests the problem parameters chosen for my tests
>> (SNES ex48) are not optimal for KNL. Does anyone have example test runs I
>> could reproduce that compare the performance between KNL and
>> Haswell/Ivybridge/etc?
>>
>
> Lets try to see what is going on with your existing data first.
>
> First, I think that main thing is to make sure we are using MCDRAM.
> Everything else in KNL
> is window dressing (IMHO). All we have to look at is something like MAXPY.
> You can get the
> bandwidth estimate from the flop rate and problem size (I think), and we
> can at least get
> bandwidth ratios between Haswell and KNL with that number.
>
>    Matt
>
>
>> On Mon, Apr 3, 2017 at 3:06 PM Richard Mills <richardtmills at gmail.com>
>> wrote:
>>
>>> Yes, one should rely on MKL (or Cray LibSci, if using the Cray
>>> toolchain) on Cori.  But I'm guessing that this will make no noticeable
>>> difference for what Justin is doing.
>>>
>>> --Richard
>>>
>>> On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <keceli at gmail.com> wrote:
>>>
>>> How about replacing --download-fblaslapack with vendor specific
>>> BLAS/LAPACK?
>>>
>>> Murat
>>>
>>> On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <richardtmills at gmail.com>
>>> wrote:
>>>
>>> On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <hongzhang at anl.gov> wrote:
>>>
>>>
>>> On Apr 3, 2017, at 1:44 PM, Justin Chang <jychang48 at gmail.com> wrote:
>>>
>>> Richard,
>>>
>>> This is what my job script looks like:
>>>
>>> #!/bin/bash
>>> #SBATCH -N 16
>>> #SBATCH -C knl,quad,flat
>>> #SBATCH -p regular
>>> #SBATCH -J knlflat1024
>>> #SBATCH -L SCRATCH
>>> #SBATCH -o knlflat1024.o%j
>>> #SBATCH --mail-type=ALL
>>> #SBATCH --mail-user=jychang48 at gmail.com
>>> #SBATCH -t 00:20:00
>>>
>>> #run the application:
>>> cd $SCRATCH/Icesheet
>>> sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
>>> srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N
>>> 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine
>>> 1
>>>
>>>
>>> Maybe it is a typo. It should be numactl -m 1.
>>>
>>>
>>> "-p 1" will also work.  "-p" means to "prefer" NUMA node 1 (the MCDRAM),
>>> whereas "-m" means to use only NUMA node 1.  In the former case, MCDRAM
>>> will be used for allocations until the available memory there has been
>>> exhausted, and then things will spill over into the DRAM.  One would think
>>> that "-m" would be better for doing performance studies, but on systems
>>> where the nodes have swap space enabled, you can get terrible performance
>>> if your code's working set exceeds the size of the MCDRAM, as the system
>>> will obediently obey your wishes to not use the DRAM and go straight to the
>>> swap disk!  I assume the Cori nodes don't have swap space, though I could
>>> be wrong.
>>>
>>>
>>> According to the NERSC info pages, they say to add the "numactl" if
>>> using flat mode. Previously I tried cache mode but the performance seems to
>>> be unaffected.
>>>
>>>
>>> Using cache mode should give similar performance as using flat mode with
>>> the numactl option. But both approaches should be significant faster than
>>> using flat mode without the numactl option. I usually see over 3X speedup.
>>> You can also do such comparison to see if the high-bandwidth memory is
>>> working properly.
>>>
>>> I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is
>>> nearly 4-5x faster. Though I suspect this drastic change has much to do
>>> with the initial coarse grid size now being extremely small.
>>>
>>> I think you may be right about why you see such a big difference.  The
>>> KNL nodes need enough work to be able to use the SIMD lanes effectively.
>>> Also, if your problem gets small enough, then it's going to be able to fit
>>> in the Haswell's L3 cache.  Although KNL has MCDRAM and this delivers *a
>>> lot* more memory bandwidth than the DDR4 memory, it will deliver a lot less
>>> bandwidth than the Haswell's L3.
>>>
>>> I'll give the COPTFLAGS a try and see what happens
>>>
>>>
>>> Make sure to use --with-memalign=64 for data alignment when configuring
>>> PETSc.
>>>
>>>
>>> Ah, yes, I forgot that.  Thanks for mentioning it, Hong!
>>>
>>>
>>> The option -xMIC-AVX512 would improve the vectorization performance. But
>>> it may cause problems for the MPIBAIJ format for some unknown reason.
>>> MPIAIJ should work fine with this option.
>>>
>>>
>>> Hmm.  Try both, and, if you see worse performance with MPIBAIJ, let us
>>> know and I'll try to figure this out.
>>>
>>> --Richard
>>>
>>>
>>>
>>> Hong (Mr.)
>>>
>>> Thanks,
>>> Justin
>>>
>>> On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <richardtmills at gmail.com>
>>> wrote:
>>>
>>> Hi Justin,
>>>
>>> How is the MCDRAM (on-package "high-bandwidth memory") configured for
>>> your KNL runs?  And if it is in "flat" mode, what are you doing to ensure
>>> that you use the MCDRAM?  Doing this wrong seems to be one of the most
>>> common reasons for unexpected poor performance on KNL.
>>>
>>> I'm not that familiar with the environment on Cori, but I think that if
>>> you are building for KNL, you should add "-xMIC-AVX512" to your compiler
>>> flags to explicitly instruct the compiler to use the AVX512 instruction
>>> set.  I usually use something along the lines of
>>>
>>>   'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'
>>>
>>> (The "-g" just adds symbols, which make the output from performance
>>> profiling tools much more useful.)
>>>
>>> That said, I think that if you are comparing 1024 Haswell cores vs. 1024
>>> KNL cores (so double the number of Haswell nodes), I'm not surprised that
>>> the simulations are almost twice as fast using the Haswell nodes.  Keep in
>>> mind that individual KNL cores are much less powerful than an individual
>>> Haswell node.  You are also using roughly twice the power footprint (dual
>>> socket Haswell node should be roughly equivalent to a KNL node, I
>>> believe).  How do things look on when you compare equal nodes?
>>>
>>> Cheers,
>>> Richard
>>>
>>> On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <jychang48 at gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> On NERSC's Cori I have the following configure options for PETSc:
>>>
>>> ./configure --download-fblaslapack --with-cc=cc --with-clib-autodetect=0
>>> --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn
>>> --with-fortranlib-autodetect=0 --with-mpiexec=srun --with-64-bit-indices=1
>>> COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt
>>>
>>> Where I swapped out the default Intel programming environment with that
>>> of Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3'). I
>>> want to document the performance difference between Cori's Haswell and KNL
>>> processors.
>>>
>>> When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell and
>>> 16 KNL nodes), the simulations are almost twice as fast on Haswell nodes.
>>> Which leads me to suspect that I am not doing something right for KNL. Does
>>> anyone know what are some "optimal" configure options for running PETSc on
>>> KNL?
>>>
>>> Thanks,
>>> Justin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/eada1bcc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testhas_flat_1node.o4407087
Type: application/octet-stream
Size: 38661 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/eada1bcc/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testknl_flat_1node.o4407080
Type: application/octet-stream
Size: 38551 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/eada1bcc/attachment-0003.obj>