[petsc-dev] Cache blocking ex2.c

Nystrom, William D wdn at lanl.gov
Mon Mar 11 13:35:03 CDT 2013


This weekend, I spent some time making a modified version of src/ksp/ksp/examples/tutorials/ex2.c
where I added the capability to divide the m x n mesh into blocks of size mblk x nblk respectively.  I
believe I have this debugged and working properly.  In building the matrix, I travel through the blocks
in the same way as the original scheme for going through the m x n matrix.  That is, I go through the
blocks with the nblk direction varying fastest and process the blocks with the blocks in the n direction
varying fastest.

I was interested in seeing if I would get better performance, both with and without threads, because
of getting better cache utilization.  However, when I try testing the new way of building the matrix and
vary mblk and nblk, I'm not really getting a meaningful speedup, either with or without threads.  Here
is the sort of command I am running with my hacked version of ex2:

ex2 -m 1000 -n 1000 -compute_matrix_flag 3 -mblk 50 -nblk 50 -ksp_type cg -pc_type jacobi \
       -log_summary -ksp_rtol 1.0e-10 -ksp_converged_reason -threadcomm_type pthread \
       -threadcomm_nthreads 12

I have varied mblk and nblk from 10 to 100.  I have not tried non-square aspect ratios.  I'm running
on my local workstation which is a dual socket Xeon with 6 cores per socket.  Using 12 threads and
the original matrix build procedure of ex2, I get a speedup of about 3x over single thread.  This
experiment was motivated by an email interchange with Jed several months back where he suggested
that the organization of the matrix and mesh for the ex2.c example was poor.  Another motivation was
that I get pretty decent speedup using threads if the problem is small enough, for instance 200 x 600
mesh.  But for larger problems like 1000 x 1000, the speedup decreases dramatically.  I assumed
this was because of the smaller problem running mostly out of cache.  I was hoping a blocking strategy
might help larger problems to get better speedup by using cache better.

Does what I am trying to do make sense?  Does my approach seem reasonable?  I'm happy to provide
my hacked version of ex2.c if anyone wants to look at it.

Thanks,

Dave

--
Dave Nystrom
LANL HPC-5
Phone: 505-667-7913
Email: wdn at lanl.gov
Smail: Mail Stop B272
       Group HPC-5
       Los Alamos National Laboratory
       Los Alamos, NM 87545




More information about the petsc-dev mailing list