[petsc-users] questions about vectorization
Jed Brown
jed at jedbrown.org
Sat Nov 18 17:51:59 CST 2017
Richard Tran Mills <rtmills at anl.gov> writes:
> On Tue, Nov 14, 2017 at 12:13 PM, Zhang, Hong <hongzhang at anl.gov> wrote:
>
>>
>>
>> On Nov 13, 2017, at 10:49 PM, Xiangdong <epscodes at gmail.com> wrote:
>>
>> 1) How about the vectorization of BAIJ format?
>>
>>
>> BAIJ kernels are optimized with manual unrolling, but not with AVX
>> intrinsics. So the vectorization relies on the compiler's ability.
>> It may or may not get vectorized depending on the compiler's optimization
>> decisions. But vectorization is not essential for the performance of most
>> BAIJ kernels.
>>
>
> I know that this has come up in previous discussions, but I'm guessing that
> the manual unrolling actually impedes the ability of many modern compilers
> to optimize the BAIJ calculations. I suppose we ought to have a switch to
> enable or disable the use of the unrolled versions? (And, further down the
> road, some sort of performance model to tell us what the setting for the
> switch should be...)
I added a crude test for BAIJ(4), see branch 'jed/matbaij-loop'.
Clang-5.0 is a bit better than gcc-7.2 for this problem. GCC produces
comparable code and performance with both versions, but Clang produces
tighter code (see below) for the current (fully unrolled) code, but it
actually executes slower than the loop code. Testing as below, which
produces a matrix with 284160 nonzeros (2.4 MB matrix, fits in my L3
cache). I use BCGS instead of GMRES so that the solve can be resident
in cache.
$ mpich-clang-opt/tests/src/snes/examples/tutorials/ex19 -da_grid_x 60 -da_grid_y 60 -prandtl 1e4 -ksp_type bcgs -dm_mat_type baij -pc_type none -mat_baij_loop 0 -log_view |grep MatMult
MatMult 16269 1.0 1.8919e+00 1.0 9.01e+09 1.0 0.0e+00 0.0e+00 0.0e+00 78 77 0 0 0 78 77 0 0 0 4763
clang MatMult_SeqBAIJ_4
0.73 │2f0: movsxd rdi,DWORD PTR [rbp+0x0]
2.44 │ add rbp,0x4
0.24 │ shl rdi,0x5
0.98 │ vbroad ymm1,QWORD PTR [rax+rdi*1]
0.73 │ vbroad ymm2,QWORD PTR [rax+rdi*1+0x8]
2.93 │ vbroad ymm3,QWORD PTR [rax+rdi*1+0x10]
0.98 │ vbroad ymm4,QWORD PTR [rax+rdi*1+0x18]
2.44 │ vfmadd ymm1,ymm0,YMMWORD PTR [rsi]
23.47 │ vfmadd ymm1,ymm2,YMMWORD PTR [rsi+0x20]
8.31 │ vfmadd ymm1,ymm3,YMMWORD PTR [rsi+0x40]
0.98 │ vmovap ymm0,ymm1
26.89 │ vfmadd ymm0,ymm4,YMMWORD PTR [rsi+0x60]
0.49 │ sub rsi,0xffffffffffffff80
│ add edx,0xffffffff
0.24 │ ↑ jne 2f0
$ mpich-clang-opt/tests/src/snes/examples/tutorials/ex19 -da_grid_x 60 -da_grid_y 60 -prandtl 1e4 -ksp_type bcgs -dm_mat_type baij -pc_type none -mat_baij_loop 1 -log_view |grep MatMult
MatMult 16269 1.0 1.6305e+00 1.0 9.01e+09 1.0 0.0e+00 0.0e+00 0.0e+00 73 77 0 0 0 73 77 0 0 0 5527
1.86 │130: cdqe
│ vmovup ymm2,YMMWORD PTR [rbx+rax*8]
14.60 │ vmovup ymm3,YMMWORD PTR [rbx+rax*8+0x20]
1.24 │ vmovup ymm4,YMMWORD PTR [rbx+rax*8+0x40]
16.77 │ vmovup ymm5,YMMWORD PTR [rbx+rax*8+0x60]
2.17 │ vmovap YMMWORD PTR [rsp+0xc0],ymm5
0.93 │ vmovap YMMWORD PTR [rsp+0xa0],ymm4
0.62 │ vmovap YMMWORD PTR [rsp+0x80],ymm3
1.86 │ vmovap YMMWORD PTR [rsp+0x60],ymm2
0.93 │ mov esi,DWORD PTR [r13+rdi*4+0x0]
0.62 │ shl esi,0x2
0.62 │ movsxd rsi,esi
1.55 │ vbroad ymm2,QWORD PTR [rcx+rsi*8]
2.17 │ vfmadd ymm2,ymm1,YMMWORD PTR [rsp+0x60]
1.24 │ vbroad ymm1,QWORD PTR [rcx+rsi*8+0x8]
10.56 │ vfmadd ymm1,ymm2,YMMWORD PTR [rsp+0x80]
0.62 │ vbroad ymm2,QWORD PTR [rcx+rsi*8+0x10]
13.35 │ vfmadd ymm2,ymm1,YMMWORD PTR [rsp+0xa0]
1.86 │ vbroad ymm1,QWORD PTR [rcx+rsi*8+0x18]
15.53 │ vfmadd ymm1,ymm2,YMMWORD PTR [rsp+0xc0]
│ add rdi,0x1
│ add eax,0x10
│ cmp rdi,rdx
│ ↑ jl 130
The code with loops is faster with GCC as well, but the assembly is not
as clean in either case.
I don't have time to do more comprehensive testing at the moment, but it
would be really useful to test with other block sizes, especially 3
(elasticity) and 5 (compressible flow) and with other compilers
(especially Intel). If the performance advantage of loops holds, we can
eliminate tons of code from PETSc by judicious use of inline functions.
More information about the petsc-users
mailing list