[petsc-dev] (S)BSTRM implementations for block sizes other than 4 and 5?
Dahai Guo
dhguo at ncsa.uiuc.edu
Mon May 9 09:37:55 CDT 2011
Jed:
I just implemented the basic frame of the BSTRM and SBTRM into PETSc. It works not bad on IBM chips, since IBM power chip has a hardware piece called prefetching eninge to hanlde multiple data prefetching streams. The following data shows some initial tests of SpMV on a IBM Power7 machine with one memory controller. You can get the "cfd.2.10" from PETSc group.
The efficiency of the format depends on the enough cache size and memory bandwidth, power bus rate, and etc. We didn't test it on many Intel and AMD chips yet, although we like to if we can fin d more machines. I will add in more functions when I have time. If you like, you can add in more functions into it yourself and make it better.
Thanks,
Dahai
MATRIX: cfd.2.10 with bs = 5 (10 times with warm-up cache)
MPI = 1
--- dt1_BAIJ, dt2_BSTRM = 48726, 28774, R = 1 .69
--- dt1_SBAIJ, dt2_SBSTRM = 48726, 21365, R = 2 .28
MPI = 2
--- dt1_BAIJ, dt2_BSTRM = 26877, 16321, R = 1 .65
--- dt1_SBAIJ, dt2_SBSTRM = 26877, 15032, R = 1 .79
MPI = 4
--- dt1_BAIJ, dt2_BSTRM = 14978, 10631, R = 1 .41
--- dt1_SBAIJ, dt2_SBSTRM = 14978, 9109, R = 1 .64
MPI = 8
--- dt1_BAIJ, dt2_BSTRM = 9071, 9738, R = 0 .93 (-- not sure why, maybe it is because this P7 chip only has one memory controller )
--- dt1_SBAIJ, dt2_SBSTRM = 9174, 6329, R = 1 .45
----- Original Message -----
From: "Jed Brown" <jed at 59A2.org>
To: "For users of the development version of PETSc" <petsc-dev at mcs.anl.gov>
Cc: "Dahai Guo" <dhguo at ncsa.uiuc.edu>
Sent: Monday, May 9, 2011 8:55:47 AM
Subject: (S)BSTRM implementations for block sizes other than 4 and 5?
I was curious to try a benchmark, but don't have a problem with these block sizes handy. Are other block sizes planned? Does someone have benchmarks against current (S)BAIJ implementations (with software prefetch)? I've seen the HPCA paper from Guo and Gropp, but I think that work was done before BAIJ had software prefetch, but also perhaps with a version of BSTRM that did not software prefetch, so I wonder how they compare now. Also, how is the performance for multiple processes per socket on Intel and AMD?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110509/870b1937/attachment.html>
More information about the petsc-dev
mailing list