[petsc-dev] Fwd: Poisson step in GTS

Sat Jun 18 20:33:42 CDT 2011

Begin forwarded message:

> From: Nathan Wichmann <wichmann at cray.com>
> Date: June 18, 2011 4:40:07 PM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>
> Cc: Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>, John Shalf <jshalf at lbl.gov>
> Subject: RE: Poisson step in GTS
> 
> Barry,
> 
> No, Cray does not provide any threaded BLAS 1.  Generally speaking it is not worth threading a single nested loop unless the trip count is very high and generally that does not happen often enough to warrant the special BLAS.  In fact, I am not even sure we omp BLAS 2, I don't think so.
> 
> Nathan
> 
> Nathan Wichmann                Cray Inc.
> wichmann at cray.com              380 Jackson St
> Applications Engineer          St Paul, MN 55101
> office:  1-800-284-2729  x605-9079          
>  cell:  651-428-1131
> 
> 
> -----Original Message-----
> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
> Sent: Saturday, June 18, 2011 12:13 PM
> To: Nathan Wichmann
> Cc: Lois Curfman McInnes; Satish Balay; Alice Koniges; Robert Preissl; Erich Strohmaier; Stephane Ethier; John Shalf
> Subject: Re: Poisson step in GTS
> 
> 
> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote:
> 
>> Hi Robert, Barry and all,
>> 
>> Is it our assumption that the Poisson version of GTS will normally be run with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads?
> 
>   Our new vector and matrix classes will allow the flexibility of any number of MPI processes and any number of threads under that. So 1 MPI rank and 6 threads is supportable.
> 
>> In that case there should be sufficient bandwidth for decent scaling; I would say something Barry's Intel experience.  Barry is certainly correct that as one uses more cores one will be more bandwidth limited.
> 
>   I would be interested in seeing the OpenMP streams for this system.
>> 
>> I also like John's comment: "we have little faith that the compiler will do anything intelligent."  Which compiler are you using?  If you are using CCE then you should get a lst file to see what it is doing.  Probably the only thing that can and should be done is unroll the inner loop.
> 
>  Do you folks a provide a thread based BLAS 1 operations? For example ddot, dscale, daxpy? If so, we can piggy-back on those to get the best possible performance on the vector operations.,
>> 
>> Another consideration is the typical size of "n".  Normally the dense the matrix the large n is, no?  But still, it would be interesting to know.
> 
>  In this application the matrix is extremely sparse, likely between 7 and 27 nonzeros per row. Matrices, of course, can get as big as you like.
> 
>   Barry
> 
>> 
>> Ciao!
>> Nathan
>> 
>> 
>> Nathan Wichmann                Cray Inc.
>> wichmann at cray.com              380 Jackson St
>> Applications Engineer          St Paul, MN 55101
>> office:  1-800-284-2729  x605-9079          
>> cell:  651-428-1131
>> 
>> -----Original Message-----
>> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
>> Sent: Friday, June 17, 2011 1:42 PM
>> To: Lois Curfman McInnes; Satish Balay
>> Cc: Alice Koniges; Robert Preissl; Erich Strohmaier; Stephane Ethier; John Shalf; Nathan Wichmann
>> Subject: Re: Poisson step in GTS
>> 
>> 
>> With the point Jacobi solver yes MatMult_SeqAIJ() will be the most time consuming code in the simulation.  With hypre's BoomerAMG it will a far more complicated set of routines because that is a much more sophisticated solver (for hypre you will need to work directly with Rob Falgout on threaded versions of his software).
>> 
>> Based on the request of the GTS folks (and others) we have begun the implementation of a thread based vector and matrix class in PETSc that will allow all three of the routines
>> 
>>>>> 2399.257095               2399.257095 38.384429  MatMult_SeqAIJ (GTS-usertime)
>>>>>             565.571417                565.571417 9.048274  VecMAXPY_Seq (GTS-usertime)
>>>>>             481.171419                481.171419 7.698004  VecMDot_Seq (GTS-usertime)
>> 
>> to run using multiple threads and improve the performance a bit. Give us a few more weeks and this will be ready for basic use in petsc-dev.
>> 
>> ___BUT___ it is important to understand that MatMult and the Vec operations are all MEMORY BANDWIDTH LIMITED operations and running on even a single core uses much of the physical memory bandwidth (depending on the system and how its memory is organized).  For example on a traditional desktop you will see almost no improvement in going from one thread to two threads in these operations. On a very good Intel system that has memory bandwidth that (somewhat) scales with the number of cores you will see some speedup. Based on (just a little) of our experience you might see a speed up of 1.7 on 2 cores and 2.5 on 4 cores (this is on decent server class system, not a usual desktop). I have no experience on the AMD side so cannot say what will happen there. You can get a good handle on how they will do by running the OpenMP version of the streams benchmark on the system, if that scales well then the PETSc threaded routines will scale well, if that scales poorly then sparse linear algebra will scale poorly.
>> 
>>  Feel free to continue this conversation, it is important,
>> 
>>  Barry
>> 
>> 
>> 
>> 
>> 
>> On Jun 17, 2011, at 6:07 AM, Lois Curfman McInnes wrote:
>> 
>>> 
>>> Thanks for the update.  I'm cc-ing this response to Barry, as he has been thinking about issues with MPI/OpenMP and may have suggestions.
>>> 
>>> My understanding from prior conversations with Stephane et al. is that GTS typically solves this equation using Conjugate Gradient with either a simple Jacobi preconditioner or otherwise hypre.  The matrix-vector product is a key kernel in CG (or any Krylov method), so improvements in this are indeed important for scalable linear solvers in GTS and in general.   I believe a matrix-vector product is also used for a poloidal smoother in GTS.  MatMult_SeqAIJ() is the local part of a parallel matrix-vector product for the default storage sparse matrix storage format (problems with multiple unknowns per mesh point typically use a block variant).  You can see the parallel layout of a sparse matrix on slide 51 of this tutorial:
>>>         http://www.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/TACCTutorial2009.pdf
>>> 
>>> Barry:  Any comments?
>>> 
>>> Lois
>>> 
>>> On Jun 16, 2011, at 5:27 PM, Alice Koniges wrote:
>>> 
>>>> Erich Strohmaier has (or had?) an LDRD project to look at the seven dwarfs (modified to "parallel motifs") so maybe he has more information on other implementations?
>>>> 
>>>> I also cc Lois because she is a good PETSc contact.
>>>> 
>>>> On Jun 16, 2011, at 2:21 PM, Robert Preissl wrote:
>>>> 
>>>>> Hello John & Stephane,
>>>>> 
>>>>> I did a bit more debugging / profiling on the GTS poisson step (Kamesh still needs time this week on the charger code) to find out:
>>>>> 1. what is the most time consuming kernel in PETSc? (is there even such a kernel??)
>>>>> 2. can we add OpenMP to it?
>>>>> 
>>>>> The most time consuming part of the poisson solver is a sparse matrix vector multiplication (sMxV), which accesses the main memory using index lists.
>>>>> (as verified at: http://www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/rechnersysteme/beschreibung_der_hpc_systeme/ultrasparc_t2/~rdt/ultrasparc_t2_smxv/?lang=de)
>>>>> 
>>>>> It seems that SMXV is one of the so called "seven dwarfs"; so, there is lots of literature about this. and I am starting to read and try to find out if there are elegant OpenMP versions available.. maybe John knows about this???
>>>>> 
>>>>> a bit more details:
>>>>> 
>>>>> The kernel of interest is called "MatMult_SeqAIJ", where most of the solve time is spent; this is also confirmed by performance analysis using the OpenSpeedShop toolkit as verified under:
>>>>> http://www.openspeedshop.org/wp/wp-content/uploads/2011/01/OSS_TARGET_USAGE2.html
>>>>> 
>>>>> Exclusive CPU time in     Inclusive CPU time in  % of Total Exclusive CPU  Function (defining location)
>>>>>               seconds.                  seconds.  Time
>>>>>            2399.257095               2399.257095 38.384429  MatMult_SeqAIJ (GTS-usertime)
>>>>>             565.571417                565.571417 9.048274  VecMAXPY_Seq (GTS-usertime)
>>>>>             481.171419                481.171419 7.698004  VecMDot_Seq (GTS-usertime)
>>>>>             362.599993                362.599993 5.801043  sched_yield (GTS-usertime)
>>>>>             191.371425                191.371425 3.061649  MatSetValues_MPIAIJ (GTS-usertime: mpiaij.c,305)
>>>>>             145.371426                145.371426 2.325720  pow (GTS-usertime)
>>>>> 
>>>>> 
>>>>> Anyway, the kernel looks like this: (copied from ./petsc300_CRAY-XE6_real64/src/mat/impls/aij/seq/aij.c  --  Line 963)
>>>>> 
>>>>>   for (i=0; i<m; i++) {
>>>>>     jrow = ii[i];
>>>>>     n    = ii[i+1] - jrow;
>>>>>     sum  = 0.0;
>>>>>     nonzerorow += (n>0);
>>>>>     for (j=0; j<n; j++) {
>>>>>       sum += aa[jrow]*x[aj[jrow]]; jrow++;
>>>>>     }
>>>>>     y[i] = sum;
>>>>>   }
>>>>> 
>>>>> 
>>>>> I am continuing my literature study, but just wanted to give you this update. Maybe we can talk a bit about this tomorrow or next week?
>>>>> 
>>>>> Robert
>>>>> 
>>>> 
>>> 
>> 
>