<div dir="ltr"><div><div>So I tried the following options:<br><br>-M 40<br>-N 40<br>-P 5<br>-da_refine 1/2/3/4<br>-log_view<br>-mg_coarse_pc_type gamg<br>-mg_levels_0_pc_type gamg<br>-mg_levels_1_sub_pc_type cholesky<br>-pc_type mg<br>-thi_mat_type baij<br><br></div>Performance improved dramatically. However, Haswell still beats out KNL but only by a little. Now it seems like MatSOR is taking some time (though I can't really judge whether it's significant or not). Attached are the log files.<br><br></div>If ex48 has SSE2 intrinsics, does that mean Haswell would almost always be better?<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 4, 2017 at 4:19 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> writes:<br>

<br>

> Attached are the job output files (which include -log_view) for SNES ex48<br>

> run on a single haswell and knl node (32 and 64 cores respectively).<br>

> Started off with a coarse grid of size 40x40x5 and ran three different<br>

> tests with -da_refine 1/2/3 and -pc_type mg<br>

><br>

> What's interesting/strange is that if i try to do -da_refine 4 on KNL, i<br>

> get a slurm error that says: "slurmstepd: error: Step 4408401.0 exceeded<br>

> memory limit (96737652 > 94371840), being killed" but it runs perfectly<br>

> fine on Haswell. Adding -pc_mg_levels 7 enables KNL to run on -da_refine 4<br>

> but the performance still does not beat out haswell.<br>

><br>

> The performance spectrum (dofs/sec) for 1-3 levels of refinement looks like<br>

> this:<br>

><br>

> Haswell:<br>

> 2.416e+03<br>

> 1.490e+04<br>

> 5.188e+04<br>

><br>

> KNL:<br>

> 9.308e+02<br>

> 7.257e+03<br>

> 3.838e+04<br>

><br>

> Which might suggest to me that KNL performs better with larger problem<br>

> sizes.<br>

<br>

</span>Look at the events.  The (redundant) coarse LU factorization takes most<br>

of the run time on KNL.  The PETSc sparse LU is not vectorized and<br>

doesn't exploit dense blocks in the way that the optimized direct<br>

solvers do.  You'll note that the paper was more aggressive about<br>

minimizing the coarse grid size and used BoomerAMG instead of redundant<br>

direct solves to avoid this scaling problem.<br>

<div class="HOEnZb"><div class="h5"><br>

> On Tue, Apr 4, 2017 at 11:05 AM, Matthew Knepley <<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>> wrote:<br>

><br>

>> On Tue, Apr 4, 2017 at 10:57 AM, Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> wrote:<br>

>><br>

>>> Thanks everyone for the helpful advice. So I tried all the suggestions<br>

>>> including using libsci. The performance did not improve for my particular<br>

>>> runs, which I think suggests the problem parameters chosen for my tests<br>

>>> (SNES ex48) are not optimal for KNL. Does anyone have example test runs I<br>

>>> could reproduce that compare the performance between KNL and<br>

>>> Haswell/Ivybridge/etc?<br>

>>><br>

>><br>

>> Lets try to see what is going on with your existing data first.<br>

>><br>

>> First, I think that main thing is to make sure we are using MCDRAM.<br>

>> Everything else in KNL<br>

>> is window dressing (IMHO). All we have to look at is something like MAXPY.<br>

>> You can get the<br>

>> bandwidth estimate from the flop rate and problem size (I think), and we<br>

>> can at least get<br>

>> bandwidth ratios between Haswell and KNL with that number.<br>

>><br>

>>    Matt<br>

>><br>

>><br>

>>> On Mon, Apr 3, 2017 at 3:06 PM Richard Mills <<a href="mailto:richardtmills@gmail.com">richardtmills@gmail.com</a>><br>

>>> wrote:<br>

>>><br>

>>>> Yes, one should rely on MKL (or Cray LibSci, if using the Cray<br>

>>>> toolchain) on Cori.  But I'm guessing that this will make no noticeable<br>

>>>> difference for what Justin is doing.<br>

>>>><br>

>>>> --Richard<br>

>>>><br>

>>>> On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <<a href="mailto:keceli@gmail.com">keceli@gmail.com</a>> wrote:<br>

>>>><br>

>>>> How about replacing --download-fblaslapack with vendor specific<br>

>>>> BLAS/LAPACK?<br>

>>>><br>

>>>> Murat<br>

>>>><br>

>>>> On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <<a href="mailto:richardtmills@gmail.com">richardtmills@gmail.com</a>><br>

>>>> wrote:<br>

>>>><br>

>>>> On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <<a href="mailto:hongzhang@anl.gov">hongzhang@anl.gov</a>> wrote:<br>

>>>><br>

>>>><br>

>>>> On Apr 3, 2017, at 1:44 PM, Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> wrote:<br>

>>>><br>

>>>> Richard,<br>

>>>><br>

>>>> This is what my job script looks like:<br>

>>>><br>

>>>> #!/bin/bash<br>

>>>> #SBATCH -N 16<br>

>>>> #SBATCH -C knl,quad,flat<br>

>>>> #SBATCH -p regular<br>

>>>> #SBATCH -J knlflat1024<br>

>>>> #SBATCH -L SCRATCH<br>

>>>> #SBATCH -o knlflat1024.o%j<br>

>>>> #SBATCH --mail-type=ALL<br>

>>>> #SBATCH --mail-user=<a href="mailto:jychang48@gmail.com">jychang48@gmail.<wbr>com</a><br>

>>>> #SBATCH -t 00:20:00<br>

>>>><br>

>>>> #run the application:<br>

>>>> cd $SCRATCH/Icesheet<br>

>>>> sbcast --compress=lz4 ./ex48cori /tmp/ex48cori<br>

>>>> srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N<br>

>>>> 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine<br>

>>>> 1<br>

>>>><br>

>>>><br>

>>>> Maybe it is a typo. It should be numactl -m 1.<br>

>>>><br>

>>>><br>

>>>> "-p 1" will also work.  "-p" means to "prefer" NUMA node 1 (the MCDRAM),<br>

>>>> whereas "-m" means to use only NUMA node 1.  In the former case, MCDRAM<br>

>>>> will be used for allocations until the available memory there has been<br>

>>>> exhausted, and then things will spill over into the DRAM.  One would think<br>

>>>> that "-m" would be better for doing performance studies, but on systems<br>

>>>> where the nodes have swap space enabled, you can get terrible performance<br>

>>>> if your code's working set exceeds the size of the MCDRAM, as the system<br>

>>>> will obediently obey your wishes to not use the DRAM and go straight to the<br>

>>>> swap disk!  I assume the Cori nodes don't have swap space, though I could<br>

>>>> be wrong.<br>

>>>><br>

>>>><br>

>>>> According to the NERSC info pages, they say to add the "numactl" if<br>

>>>> using flat mode. Previously I tried cache mode but the performance seems to<br>

>>>> be unaffected.<br>

>>>><br>

>>>><br>

>>>> Using cache mode should give similar performance as using flat mode with<br>

>>>> the numactl option. But both approaches should be significant faster than<br>

>>>> using flat mode without the numactl option. I usually see over 3X speedup.<br>

>>>> You can also do such comparison to see if the high-bandwidth memory is<br>

>>>> working properly.<br>

>>>><br>

>>>> I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is<br>

>>>> nearly 4-5x faster. Though I suspect this drastic change has much to do<br>

>>>> with the initial coarse grid size now being extremely small.<br>

>>>><br>

>>>> I think you may be right about why you see such a big difference.  The<br>

>>>> KNL nodes need enough work to be able to use the SIMD lanes effectively.<br>

>>>> Also, if your problem gets small enough, then it's going to be able to fit<br>

>>>> in the Haswell's L3 cache.  Although KNL has MCDRAM and this delivers *a<br>

>>>> lot* more memory bandwidth than the DDR4 memory, it will deliver a lot less<br>

>>>> bandwidth than the Haswell's L3.<br>

>>>><br>

>>>> I'll give the COPTFLAGS a try and see what happens<br>

>>>><br>

>>>><br>

>>>> Make sure to use --with-memalign=64 for data alignment when configuring<br>

>>>> PETSc.<br>

>>>><br>

>>>><br>

>>>> Ah, yes, I forgot that.  Thanks for mentioning it, Hong!<br>

>>>><br>

>>>><br>

>>>> The option -xMIC-AVX512 would improve the vectorization performance. But<br>

>>>> it may cause problems for the MPIBAIJ format for some unknown reason.<br>

>>>> MPIAIJ should work fine with this option.<br>

>>>><br>

>>>><br>

>>>> Hmm.  Try both, and, if you see worse performance with MPIBAIJ, let us<br>

>>>> know and I'll try to figure this out.<br>

>>>><br>

>>>> --Richard<br>

>>>><br>

>>>><br>

>>>><br>

>>>> Hong (Mr.)<br>

>>>><br>

>>>> Thanks,<br>

>>>> Justin<br>

>>>><br>

>>>> On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <<a href="mailto:richardtmills@gmail.com">richardtmills@gmail.com</a>><br>

>>>> wrote:<br>

>>>><br>

>>>> Hi Justin,<br>

>>>><br>

>>>> How is the MCDRAM (on-package "high-bandwidth memory") configured for<br>

>>>> your KNL runs?  And if it is in "flat" mode, what are you doing to ensure<br>

>>>> that you use the MCDRAM?  Doing this wrong seems to be one of the most<br>

>>>> common reasons for unexpected poor performance on KNL.<br>

>>>><br>

>>>> I'm not that familiar with the environment on Cori, but I think that if<br>

>>>> you are building for KNL, you should add "-xMIC-AVX512" to your compiler<br>

>>>> flags to explicitly instruct the compiler to use the AVX512 instruction<br>

>>>> set.  I usually use something along the lines of<br>

>>>><br>

>>>>   'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512'<br>

>>>><br>

>>>> (The "-g" just adds symbols, which make the output from performance<br>

>>>> profiling tools much more useful.)<br>

>>>><br>

>>>> That said, I think that if you are comparing 1024 Haswell cores vs. 1024<br>

>>>> KNL cores (so double the number of Haswell nodes), I'm not surprised that<br>

>>>> the simulations are almost twice as fast using the Haswell nodes.  Keep in<br>

>>>> mind that individual KNL cores are much less powerful than an individual<br>

>>>> Haswell node.  You are also using roughly twice the power footprint (dual<br>

>>>> socket Haswell node should be roughly equivalent to a KNL node, I<br>

>>>> believe).  How do things look on when you compare equal nodes?<br>

>>>><br>

>>>> Cheers,<br>

>>>> Richard<br>

>>>><br>

>>>> On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>><br>

>>>> wrote:<br>

>>>><br>

>>>> Hi all,<br>

>>>><br>

>>>> On NERSC's Cori I have the following configure options for PETSc:<br>

>>>><br>

>>>> ./configure --download-fblaslapack --with-cc=cc --with-clib-autodetect=0<br>

>>>> --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn<br>

>>>> --with-fortranlib-autodetect=0 --with-mpiexec=srun --with-64-bit-indices=1<br>

>>>> COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt<br>

>>>><br>

>>>> Where I swapped out the default Intel programming environment with that<br>

>>>> of Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3'). I<br>

>>>> want to document the performance difference between Cori's Haswell and KNL<br>

>>>> processors.<br>

>>>><br>

>>>> When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell and<br>

>>>> 16 KNL nodes), the simulations are almost twice as fast on Haswell nodes.<br>

>>>> Which leads me to suspect that I am not doing something right for KNL. Does<br>

>>>> anyone know what are some "optimal" configure options for running PETSc on<br>

>>>> KNL?<br>

>>>><br>

>>>> Thanks,<br>

>>>> Justin<br>

>>>><br>

>>>><br>

>>>><br>

>>>><br>

>>>><br>

>>>><br>

>>>><br>

>>>><br>

>><br>

>><br>

>> --<br>

>> What most experimenters take for granted before they begin their<br>

>> experiments is infinitely more interesting than any results to which their<br>

>> experiments lead.<br>

>> -- Norbert Wiener<br>

>><br>

</div></div></blockquote></div><br></div>