2 Questions about DAs

Mon May 12 19:22:04 CDT 2008

   A couple of items.

     Overlapping communication and computation is pretty much a myth.  
The CPU is used by MPI to pack
the messages and put them on the network so it is not available for  
computation during this time. Usually
if you try to overlap communication and computation it will end up  
being slower and I've never seen it faster.
Vendors will try to trick you into buying a machine by saying it does  
it, but it really doesn't. Just forget about trying to do it.

    Creating a DA involves a good amount of setup and some  
communication; it is fine to use a few DA's
but setting up hundreds of DAs is not a good idea UNLESS YOU DO TONS  
OF WORK for each DA.
In your case you are doing just a tiny amount of communication  with  
each DA so the DA setup time
is dominating.

   If you have hundreds of vectors that you wish to communicate AT THE  
SAME TIME (seems strange but
I suppose it is possible), then rather than having hundreds of  
DAGlobalToLocalBegin/End() in a row
you will want to create an additional "meta" DA that has the same  
m,n,p as the regular DA but has a
dof equal to the number of vectors you wish to communicate at the same  
time. Use VecStrideScatterAll()
to get the individual vectors into a meta vector, do the  
DAGlobalToLocalBegin/End() on the meta vector
to get the ghost values and then use DAStrideGatherAll() to get the  
values into the 322 individual ghosted
vectors. The reason to do it this way is so the values in all the  
vectors are all sent together in a single
MPI message instead of the separate message that would needed for each  
of the small
DAGlobalToLocalBegin/End().

    Barry

On May 12, 2008, at 6:21 PM, Milad Fatenejad wrote:

> Hi:
> I created a simple test problem that demonstrates the issue. In the
> test problem, 100 vectors are created using:
> single.cpp: a single distributed array and
> multi.cpp: 100 distributed arrays
>
> Some math is performed on the vectors, then they are scattered to
> local vectors..
>
> The log summary (running 2 processes) shows that multi.cpp uses more
> memory and performs more reductions than single.cpp, which is similar
> to the experience I had with my program...
>
> I hope this helps
> Milad
>
> On Mon, May 12, 2008 at 3:15 PM, Matthew Knepley <knepley at gmail.com>  
> wrote:
>> On Mon, May 12, 2008 at 3:01 PM, Milad Fatenejad <icksa1 at gmail.com>  
>> wrote:
>>> Hello:
>>> I've attached the result of two calculations. The file "log-multi- 
>>> da"
>>> uses 1 DA for each vector (322 in all) and the file "log-single-da"
>>> using 1 DA for the entire calculation. When using 322 DA's, about  
>>> 10x
>>> more time is spent in VecScatterBegin and VecScatterEnd. Both were
>>> running using two processes
>>>
>>> I should mention that the source code for these two runs was exactly
>>> the same, I didn't reorder the scatters differently. The only
>>> difference was the number of DAs
>>>
>>> Any suggestions? Do you think this is related to the number of DA's,
>>> or something else?
>>
>> There are vastly different numbers of reductions and much bigger  
>> memory usage.
>> Please send the code and I will look at it.
>>
>>  Matt
>>
>>
>>
>>> Thanks for your help
>>> Milad
>>>
>>> On Mon, May 12, 2008 at 1:56 PM, Matthew Knepley  
>>> <knepley at gmail.com> wrote:
>>>>
>>>> On Mon, May 12, 2008 at 11:02 AM, Milad Fatenejad <mfatenejad at wisc.edu 
>>>> > wrote:
>>>>> Hello:
>>>>> I have two separate DA questions:
>>>>>
>>>>> 1) I am writing a large finite difference code and would like to  
>>>>> be
>>>>> able to represent an array of vectors. I am currently doing this  
>>>>> by
>>>>> creating a single DA and calling DACreateGlobalVector several  
>>>>> times,
>>>>> but the manual also states that:
>>>>>
>>>>> "PETSc currently provides no container for multiple arrays  
>>>>> sharing the
>>>>> same distributed array communication; note, however, that the dof
>>>>> parameter handles many cases of interest."
>>>>>
>>>>> I also found the following mailing list thread which describes  
>>>>> how to
>>>>> use the dof parameter to represent several vectors:
>>>>>
>>>>>
>>>>> http://www-unix.mcs.anl.gov/web-mail-archive/lists/petsc-users/2008/02/msg00040.html
>>>>>
>>>>> Where the following solution is proposed:
>>>>> """
>>>>> The easiest thing to do in C is to declare a struct:
>>>>>
>>>>> typedef struct {
>>>>>  PetscScalar v[3];
>>>>>  PetscScalar p;
>>>>> } Space;
>>>>>
>>>>> and then cast pointers
>>>>>
>>>>>  Space ***array;
>>>>>
>>>>>  DAVecGetArray(da, u, (void *) &array);
>>>>>
>>>>>     array[k][j][i].v *= -1.0;
>>>>> """
>>>>>
>>>>> The problem with the proposed solution, is that they use a  
>>>>> struct to
>>>>> get the individual values, but what if you don't know the number  
>>>>> of
>>>>> degrees of freedom at compile time?
>>>>
>>>> It would be nice to get variable structs in C. However, you can  
>>>> just deference
>>>> the object directly. For example, for 50 degrees of freedom, you  
>>>> can do
>>>>
>>>>   array[k][j][i][47] *= -1.0;
>>>>
>>>>
>>>>> So my question is two fold:
>>>>> a) Is there a problem with just having a single DA and calling
>>>>> DACreateGlobalVector multiple times? Does this affect  
>>>>> performance at
>>>>> all (I have many different vectors)?
>>>>
>>>> These are all independent objects. Thus, by itself, creating any  
>>>> number of
>>>> Vecs does nothing to performance (unless you start to run out of  
>>>> memory).
>>>>
>>>>
>>>>> b) Is there a way to use the dof parameter when creating a DA  
>>>>> when the
>>>>> number of degrees of freedom is not known at compile time?
>>>>> Specifically, I would like to be able to access the individual  
>>>>> values
>>>>> of the vector, just like the example shows...
>>>>
>>>>
>>>> see above.
>>>>
>>>>> 2) The code I am writing has a lot of different parts which  
>>>>> present a
>>>>> lot of opportunities to overlap communication an computation when
>>>>> scattering vectors to update values in the ghost points. Right  
>>>>> now,
>>>>> all of my vectors (there are ~50 of them) share a single DA  
>>>>> because
>>>>> they all have the same shape. However, by sharing a single DA, I  
>>>>> can
>>>>> only scatter one vector at a time. It would be nice to be able to
>>>>> start scattering each vector right after I'm done computing it,  
>>>>> and
>>>>> finish scattering it right before I need it again but I can't  
>>>>> because
>>>>> other vectors might need to be scattered in between. I then re- 
>>>>> wrote
>>>>> part of my code so that each vector had its own DA object, but  
>>>>> this
>>>>> ended up being incredibly slow (I assume this is because I have so
>>>>> many vectors).
>>>>
>>>> The problem here is that buffering will have to be done for each  
>>>> outstanding
>>>> scatter. Thus I see two resolutions:
>>>>
>>>>  1) Duplicate the DA scatter for as many Vecs as you wish to  
>>>> scatter at once.
>>>>      This is essentially what you accomplish with separate DAs.
>>>>
>>>>  2) You the dof method. However, this scatter ALL the vectors  
>>>> every time.
>>>>
>>>> I do not understand what performance problem you would have with  
>>>> multiple
>>>> DAs. With any performance questions, we suggest sending the  
>>>> output of
>>>> -log_summary so we have data to look at.
>>>>
>>>>  Matt
>>>>
>>>>
>>>>
>>>>> My question is, is there a way to scatter multiple vectors
>>>>> simultaneously without affecting the performance of the code?  
>>>>> Does it
>>>>> make sense to do this?
>>>>>
>>>>>
>>>>> I'd really appreciate any help...
>>>>>
>>>>> Thanks
>>>>> Milad Fatenejad
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to  
>>>> which
>>>> their experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>>
>>>
>>
>>
>>
>> --
>>
>>
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>>
>>
> <log-multi><log-single><multi.cpp><single.cpp>