[petsc-users] Scalability of AO ?

Barry Smith bsmith at mcs.anl.gov
Wed Mar 9 09:11:38 CST 2011


  Sebastian,

    The AOCreateMapping() is badly named (thanks Matt), it is a "generalization" of AOCreateBasic() which is (still) badly named but not as badly named as AOCreateMapping(). 

     The thing to understand is that the AO operations are handled by function table table dispatch (runtime polymorphism just like the Vec operations, Mat operations etc). So, in theory there can be multiple AO implementations (concrete subclasses).  Different implementations can have different performance characteristics; for example those defined by AOCreateBasic() trade memory for speed, by storing the entire mapping on each process they are fast (since AOApplicationToPetsc() and AOPetscToApplication()) for this implementation don't require any parallel communication. The name "Basic" is suppose to convey that it is a simple implementation without bells-and-wistles like scalability to very large problems. 

    To get what you want with memory scalability there needs to be another implementation of AO that does not store all values on all processes. This is possible to write and requires possibly communication during the 
AOApplicationToPetsc() and AOPetscToApplication() to determine values that are not know locally. It is not terribly difficult to write but is clearly much more complicated than those operations defined with AOCreateBasic(). 

    So why doesn't it already exist? Well most people write their code to avoid the need for AOApplicationToPetsc() and AOPetscToApplication() functionality and those who have use AO don't run problems on such a huge system like you do. 

    So, what should we do? Well the best thing to do would be to implement a memory scalable version of the AO. One way to do this is to divide up the numbers 0 to N across the processes and have process 0 know the mapping for application ordering numbers from 0 to N/p-1, process 1 know for application ordering numbers from N/p to 2N/p-1 etc then when any process needs to know the PETSc number for application number a it determines which process knows the PETSc number of a (by simply determining for which k the  k*N/p <= a <  (k+1)*N/p-1 satisfies) and then sending a message to process k to get the PETSc number. Similarly to handle the mapping from PETSc to application one would divide up the knowledge of each application number for each PETSc number in the same way. One could make it even more sophisticated and have each process "cache" the information about each partner it has already checked on so that it need not communicate if the user asks again for the same value. Under normal circumstances I don't think this is needed because one shouldn't be normally calling AOApplicationToPetsc() or AOPetscToApplication() many times.

  So who will write the memory scalable implementation? Maybe it is time to finally get it done.


    Barry




On Mar 9, 2011, at 8:47 AM, Sebastian Steiger wrote:

> Hello PETSc experts
> 
> I have a parallel application that builds extensively on PETSc
> functionality and also uses the AO commands AOCreateMapping and
> AOApplicationToPetsc. We are currently doing some benchmarks on jaguar,
> the world's second-fastest computer, where we find some interior
> eigenvalues of a really large matrix (in conjunction with SLEPc).
> 
> The application runs fine when using 40'000 cores and a matrix size of
> 400 million. There are 20 million AO-indices. However, when I scale up
> to 80'000 / 800 million / 40 million, I am running out of memory (jaguar
> has 1.3GB/core). I am pretty sure that in our own code all vectors have
> only the size of the local degrees of freedom, which stays constant at
> around 10'000.
> 
> I figured out that I am running out of memory when I call
> AOCreateMapping. When I look inside aomapping.c, I see a comment "get
> all indices on all processors" near line 330 and some MPI_Allgatherv's.
> That seems like the routine AOApplicationToPetsc is not scalable.
> 
> Not thinking for it too long, it seems to me that the task of creating
> this mapping could be done without having the need of communicating all
> indices to all processors (I am using PETSC_NULL for the mypetsc
> argument). Let me know what you think about this.
> 
> Best
> Sebastian



More information about the petsc-users mailing list