[petsc-users] Configuring PETSc for KNL

Tue Apr 4 22:45:52 CDT 2017

So I tried the following options:

-M 40
-N 40
-P 5
-da_refine 1/2/3/4
-log_view
-mg_coarse_pc_type gamg
-mg_levels_0_pc_type gamg
-mg_levels_1_sub_pc_type cholesky
-pc_type mg
-thi_mat_type baij

Performance improved dramatically. However, Haswell still beats out KNL but
only by a little. Now it seems like MatSOR is taking some time (though I
can't really judge whether it's significant or not). Attached are the log
files.

If ex48 has SSE2 intrinsics, does that mean Haswell would almost always be
better?

On Tue, Apr 4, 2017 at 4:19 PM, Jed Brown <jed at jedbrown.org> wrote:

> Justin Chang <jychang48 at gmail.com> writes:
>
> > Attached are the job output files (which include -log_view) for SNES ex48
> > run on a single haswell and knl node (32 and 64 cores respectively).
> > Started off with a coarse grid of size 40x40x5 and ran three different
> > tests with -da_refine 1/2/3 and -pc_type mg
> >
> > What's interesting/strange is that if i try to do -da_refine 4 on KNL, i
> > get a slurm error that says: "slurmstepd: error: Step 4408401.0 exceeded
> > memory limit (96737652 > 94371840), being killed" but it runs perfectly
> > fine on Haswell. Adding -pc_mg_levels 7 enables KNL to run on -da_refine
> 4
> > but the performance still does not beat out haswell.
> >
> > The performance spectrum (dofs/sec) for 1-3 levels of refinement looks
> like
> > this:
> >
> > Haswell:
> > 2.416e+03
> > 1.490e+04
> > 5.188e+04
> >
> > KNL:
> > 9.308e+02
> > 7.257e+03
> > 3.838e+04
> >
> > Which might suggest to me that KNL performs better with larger problem
> > sizes.
>
> Look at the events.  The (redundant) coarse LU factorization takes most
> of the run time on KNL.  The PETSc sparse LU is not vectorized and
> doesn't exploit dense blocks in the way that the optimized direct
> solvers do.  You'll note that the paper was more aggressive about
> minimizing the coarse grid size and used BoomerAMG instead of redundant
> direct solves to avoid this scaling problem.
>
> > On Tue, Apr 4, 2017 at 11:05 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
> >
> >> On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang <jychang48 at gmail.com>
> wrote:
> >>
> >>> Thanks everyone for the helpful advice. So I tried all the suggestions
> >>> including using libsci. The performance did not improve for my
> particular
> >>> runs, which I think suggests the problem parameters chosen for my tests
> >>> (SNES ex48) are not optimal for KNL. Does anyone have example test
> runs I
> >>> could reproduce that compare the performance between KNL and
> >>> Haswell/Ivybridge/etc?
> >>>
> >>
> >> Lets try to see what is going on with your existing data first.
> >>
> >> First, I think that main thing is to make sure we are using MCDRAM.
> >> Everything else in KNL
> >> is window dressing (IMHO). All we have to look at is something like
> MAXPY.
> >> You can get the
> >> bandwidth estimate from the flop rate and problem size (I think), and we
> >> can at least get
> >> bandwidth ratios between Haswell and KNL with that number.
> >>
> >>    Matt
> >>
> >>
> >>> On Mon, Apr 3, 2017 at 3:06 PM Richard Mills <richardtmills at gmail.com>
> >>> wrote:
> >>>
> >>>> Yes, one should rely on MKL (or Cray LibSci, if using the Cray
> >>>> toolchain) on Cori.  But I'm guessing that this will make no
> noticeable
> >>>> difference for what Justin is doing.
> >>>>
> >>>> --Richard
> >>>>
> >>>> On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <keceli at gmail.com>
> wrote:
> >>>>
> >>>> How about replacing --download-fblaslapack with vendor specific
> >>>> BLAS/LAPACK?
> >>>>
> >>>> Murat
> >>>>
> >>>> On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <
> richardtmills at gmail.com>
> >>>> wrote:
> >>>>
> >>>> On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <hongzhang at anl.gov>
> wrote:
> >>>>
> >>>>
> >>>> On Apr 3, 2017, at 1:44 PM, Justin Chang <jychang48 at gmail.com> wrote:
> >>>>
> >>>> Richard,
> >>>>
> >>>> This is what my job script looks like:
> >>>>
> >>>> #!/bin/bash
> >>>> #SBATCH -N 16
> >>>> #SBATCH -C knl,quad,flat
> >>>> #SBATCH -p regular
> >>>> #SBATCH -J knlflat1024
> >>>> #SBATCH -L SCRATCH
> >>>> #SBATCH -o knlflat1024.o%j
> >>>> #SBATCH --mail-type=ALL
> >>>> #SBATCH --mail-user=jychang48 at gmail.com
> >>>> #SBATCH -t 00:20:00
> >>>>
> >>>> #run the application:
> >>>> cd $SCRATCH/Icesheet
> >>>> sbcast --compress=lz4 ./ex48cori /tmp/ex48cori
> >>>> srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128
> -N
> >>>> 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg
> -da_refine
> >>>> 1
> >>>>
> >>>>
> >>>> Maybe it is a typo. It should be numactl -m 1.
> >>>>
> >>>>
> >>>> "-p 1" will also work.  "-p" means to "prefer" NUMA node 1 (the
> MCDRAM),
> >>>> whereas "-m" means to use only NUMA node 1.  In the former case,
> MCDRAM
> >>>> will be used for allocations until the available memory there has been
> >>>> exhausted, and then things will spill over into the DRAM.  One would
> think
> >>>> that "-m" would be better for doing performance studies, but on
> systems
> >>>> where the nodes have swap space enabled, you can get terrible
> performance
> >>>> if your code's working set exceeds the size of the MCDRAM, as the
> system
> >>>> will obediently obey your wishes to not use the DRAM and go straight
> to the
> >>>> swap disk!  I assume the Cori nodes don't have swap space, though I
> could
> >>>> be wrong.
> >>>>
> >>>>
> >>>> According to the NERSC info pages, they say to add the "numactl" if
> >>>> using flat mode. Previously I tried cache mode but the performance
> seems to
> >>>> be unaffected.
> >>>>
> >>>>
> >>>> Using cache mode should give similar performance as using flat mode
> with
> >>>> the numactl option. But both approaches should be significant faster
> than
> >>>> using flat mode without the numactl option. I usually see over 3X
> speedup.
> >>>> You can also do such comparison to see if the high-bandwidth memory is
> >>>> working properly.
> >>>>
> >>>> I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is
> >>>> nearly 4-5x faster. Though I suspect this drastic change has much to
> do
> >>>> with the initial coarse grid size now being extremely small.
> >>>>
> >>>> I think you may be right about why you see such a big difference.  The
> >>>> KNL nodes need enough work to be able to use the SIMD lanes
> effectively.
> >>>> Also, if your problem gets small enough, then it's going to be able
> to fit
> >>>> in the Haswell's L3 cache.  Although KNL has MCDRAM and this delivers
> *a
> >>>> lot* more memory bandwidth than the DDR4 memory, it will deliver a
> lot less
> >>>> bandwidth than the Haswell's L3.
> >>>>
> >>>> I'll give the COPTFLAGS a try and see what happens
> >>>>
> >>>>
> >>>> Make sure to use --with-memalign=64 for data alignment when
> configuring
> >>>> PETSc.
> >>>>
> >>>>
> >>>> Ah, yes, I forgot that.  Thanks for mentioning it, Hong!
> >>>>
> >>>>
> >>>> The option -xMIC-AVX512 would improve the vectorization performance.
> But
> >>>> it may cause problems for the MPIBAIJ format for some unknown reason.
> >>>> MPIAIJ should work fine with this option.
> >>>>
> >>>>
> >>>> Hmm.  Try both, and, if you see worse performance with MPIBAIJ, let us
> >>>> know and I'll try to figure this out.
> >>>>
> >>>> --Richard
> >>>>
> >>>>
> >>>>
> >>>> Hong (Mr.)
> >>>>
> >>>> Thanks,
> >>>> Justin
> >>>>
> >>>> On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <
> richardtmills at gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hi Justin,
> >>>>
> >>>> How is the MCDRAM (on-package "high-bandwidth memory") configured for
> >>>> your KNL runs?  And if it is in "flat" mode, what are you doing to
> ensure
> >>>> that you use the MCDRAM?  Doing this wrong seems to be one of the most
> >>>> common reasons for unexpected poor performance on KNL.
> >>>>
> >>>> I'm not that familiar with the environment on Cori, but I think that
> if
> >>>> you are building for KNL, you should add "-xMIC-AVX512" to your
> compiler
> >>>> flags to explicitly instruct the compiler to use the AVX512
> instruction
> >>>> set.  I usually use something along the lines of
> >>>>
> >>>>   'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'
> >>>>
> >>>> (The "-g" just adds symbols, which make the output from performance
> >>>> profiling tools much more useful.)
> >>>>
> >>>> That said, I think that if you are comparing 1024 Haswell cores vs.
> 1024
> >>>> KNL cores (so double the number of Haswell nodes), I'm not surprised
> that
> >>>> the simulations are almost twice as fast using the Haswell nodes.
> Keep in
> >>>> mind that individual KNL cores are much less powerful than an
> individual
> >>>> Haswell node.  You are also using roughly twice the power footprint
> (dual
> >>>> socket Haswell node should be roughly equivalent to a KNL node, I
> >>>> believe).  How do things look on when you compare equal nodes?
> >>>>
> >>>> Cheers,
> >>>> Richard
> >>>>
> >>>> On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <jychang48 at gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> On NERSC's Cori I have the following configure options for PETSc:
> >>>>
> >>>> ./configure --download-fblaslapack --with-cc=cc
> --with-clib-autodetect=0
> >>>> --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0
> --with-fc=ftn
> >>>> --with-fortranlib-autodetect=0 --with-mpiexec=srun
> --with-64-bit-indices=1
> >>>> COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt
> >>>>
> >>>> Where I swapped out the default Intel programming environment with
> that
> >>>> of Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3').
> I
> >>>> want to document the performance difference between Cori's Haswell
> and KNL
> >>>> processors.
> >>>>
> >>>> When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell
> and
> >>>> 16 KNL nodes), the simulations are almost twice as fast on Haswell
> nodes.
> >>>> Which leads me to suspect that I am not doing something right for
> KNL. Does
> >>>> anyone know what are some "optimal" configure options for running
> PETSc on
> >>>> KNL?
> >>>>
> >>>> Thanks,
> >>>> Justin
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
> >> --
> >> What most experimenters take for granted before they begin their
> >> experiments is infinitely more interesting than any results to which
> their
> >> experiments lead.
> >> -- Norbert Wiener
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/ba4f217f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testhas_sor_1node.o4410779
Type: application/octet-stream
Size: 69141 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/ba4f217f/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testknl_sor_1node.o4410753
Type: application/octet-stream
Size: 68989 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170404/ba4f217f/attachment-0003.obj>