[petsc-users] Scalability of AO ?

Wed Mar 9 10:03:58 CST 2011

Hi Barry

Thanks for this useful insight - after 2 years of employing most of
PETSc's functionality we have finally found a corner that doesn't work
with process counts that about 15 machines in the world can provide :-)

I was thinking of something similar to your outline of the
memory-scalable implementation. Since our code performs the call to this
AO function very few times (1-3x), a little communication overhead is
really not too bad for us.

Best
Sebastian

On 03/09/2011 10:11 AM, Barry Smith wrote:
> 
>   Sebastian,
> 
>     The AOCreateMapping() is badly named (thanks Matt), it is a "generalization" of AOCreateBasic() which is (still) badly named but not as badly named as AOCreateMapping(). 
> 
>      The thing to understand is that the AO operations are handled by function table table dispatch (runtime polymorphism just like the Vec operations, Mat operations etc). So, in theory there can be multiple AO implementations (concrete subclasses).  Different implementations can have different performance characteristics; for example those defined by AOCreateBasic() trade memory for speed, by storing the entire mapping on each process they are fast (since AOApplicationToPetsc() and AOPetscToApplication()) for this implementation don't require any parallel communication. The name "Basic" is suppose to convey that it is a simple implementation without bells-and-wistles like scalability to very large problems. 
> 
>     To get what you want with memory scalability there needs to be another implementation of AO that does not store all values on all processes. This is possible to write and requires possibly communication during the 
> AOApplicationToPetsc() and AOPetscToApplication() to determine values that are not know locally. It is not terribly difficult to write but is clearly much more complicated than those operations defined with AOCreateBasic(). 
> 
>     So why doesn't it already exist? Well most people write their code to avoid the need for AOApplicationToPetsc() and AOPetscToApplication() functionality and those who have use AO don't run problems on such a huge system like you do. 
> 
>     So, what should we do? Well the best thing to do would be to implement a memory scalable version of the AO. One way to do this is to divide up the numbers 0 to N across the processes and have process 0 know the mapping for application ordering numbers from 0 to N/p-1, process 1 know for application ordering numbers from N/p to 2N/p-1 etc then when any process needs to know the PETSc number for application number a it determines which process knows the PETSc number of a (by simply determining for which k the  k*N/p <= a <  (k+1)*N/p-1 satisfies) and then sending a message to process k to get the PETSc number. Similarly to handle the mapping from PETSc to application one would divide up the knowledge of each application number for each PETSc number in the same way. One could make it even more sophisticated and have each process "cache" the information about each partner it has already checked on so that it need not communicate if the user asks again for the same value. Un
der normal circumstances I don't think this is needed because one shouldn't be normally calling AOApplicationToPetsc() or AOPetscToApplication() many times.
> 
>   So who will write the memory scalable implementation? Maybe it is time to finally get it done.
> 
> 
>     Barry
> 
> 
> 
> 
> On Mar 9, 2011, at 8:47 AM, Sebastian Steiger wrote:
> 
>> Hello PETSc experts
>>
>> I have a parallel application that builds extensively on PETSc
>> functionality and also uses the AO commands AOCreateMapping and
>> AOApplicationToPetsc. We are currently doing some benchmarks on jaguar,
>> the world's second-fastest computer, where we find some interior
>> eigenvalues of a really large matrix (in conjunction with SLEPc).
>>
>> The application runs fine when using 40'000 cores and a matrix size of
>> 400 million. There are 20 million AO-indices. However, when I scale up
>> to 80'000 / 800 million / 40 million, I am running out of memory (jaguar
>> has 1.3GB/core). I am pretty sure that in our own code all vectors have
>> only the size of the local degrees of freedom, which stays constant at
>> around 10'000.
>>
>> I figured out that I am running out of memory when I call
>> AOCreateMapping. When I look inside aomapping.c, I see a comment "get
>> all indices on all processors" near line 330 and some MPI_Allgatherv's.
>> That seems like the routine AOApplicationToPetsc is not scalable.
>>
>> Not thinking for it too long, it seems to me that the task of creating
>> this mapping could be done without having the need of communicating all
>> indices to all processors (I am using PETSC_NULL for the mypetsc
>> argument). Let me know what you think about this.
>>
>> Best
>> Sebastian
>