<div dir="ltr">Hi Barry,<div><br></div><div>I like your suggestion and I'll give this implementation a try. I've used some experimental tools that interpose on memory allocation calls and then track the accesses to give similar information, but having what you suggest implemented in PETSc would be easier and more useful in a lot of ways.</div><div><br></div><div>What we really need is dynamically updated priorities for what arrays get placed in the high-bandwidth memory. This sort of tracking might enable a reasonable way to estimate these priorities. (This only tells us about PETSc's memory and doesn't solve the "global" problem, but it's a start.)</div><div><br></div><div>I have to think about it a bit more, but I still believe that using something like move_pages(2) will preclude the use of a heap manager the high-bandwidth memory. Maybe we don't need one. If we do, then, yes, I think we can deal with the inability to move an array between the different types of memory while keeping the same virtual address because we can just switch the ->array pointer.</div><div><br></div><div>I'll plan to implement the very simple (threshold-based) placement approach and the tracking you suggest, and the evaluate whether the simple approach seems adequate or whether it would be worthwhile to support more complex options.</div><div><br></div><div>--Richard<br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 3, 2015 at 7:39 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Richard,<br>
<br>
If the code does not use VecSetValues() then one could measure the "importance" of each vector by counting two numbers, the number of times VecGetArray() is called on the vector and the number of times VecGetArrayRead() is called. We don't currently measure this but you could add cntread and cntwrite fields to _p_Vec and have VecGetArray[Read]() increment them. Then in VecDestroy() just have the vector print its name and the cnts. It would be interesting to see how many vectors there are, for an example in src/ts/examples/tutorials (or subdirectory) and what the distributions of these cnts is.<br>
<br>
Barry<br>
<br>
The reason this is unreliable for when VecSetValues() is used is that EACH VecSetValues() calls VecGetArray() which will result in artificially high write cnts when each one represents only accessing a tiny part of the vector.<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
> On Jun 3, 2015, at 9:26 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
><br>
><br>
> To follow up on this, going back to my "advise object" to malloc being a living object as opposed to just some flags. In the case where different vectors may have very different "importances" at different times in the runtime of the simulation one could "switch" some vectors from using slow to faster memory when one knows the code is switching to a different phase where the vector "importances" are different.<br>
><br>
> Barry<br>
><br>
> Note that even if Intel cannot provide a way to "switch" a memory address between fast and slow it doesn't really mater from the PETSc point of view since inside any particular PETSc vector we would could switch the ->array pointer to a different memory location (and copy stuff over if needed) when changing a vector from important to unimportant or the opposite. (since no code outside the vector object knows what the pointer is).<br>
><br>
><br>
>> On Jun 3, 2015, at 9:18 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
>><br>
>><br>
>>> On Jun 3, 2015, at 8:55 PM, Richard Mills <<a href="mailto:rtm@utk.edu">rtm@utk.edu</a>> wrote:<br>
>>><br>
>>> Ha, yes. I'll try this out, but I do wonder what people's thoughts are on the best way to "tag" an object like a Vec or Mat for some particular treatment of its placement in memory. Does doing this at the level of a Mat or Vec (e.g., VecSetAdvMallocCtx() ) sound appropriate? We could actually make this a part of any PetscObject, but I think that's not necessary.<br>
>><br>
>> No idea.<br>
>><br>
>> Perhaps, and this is just nonsense off the top of my head, if you had some measure of the importance of a vector (or matrix; I would start with vectors for simplicity and since we have more of them) based on how often it's values would be "accessed". So a vector that you know is only used "once in a while" gets a lower "importance" than one that gets used "very often". Of course determining these vectors importances may be difficult. You could do it experimentally, add some code that measures how often each vector gets its values "accessed (whatever that means)/read write" and see if there is some distribution (do this for a nontrivial TS example) where some vectors are accessed often and others rarely. Now place the often "accessed" vectors in faster memory and see how much faster the code is.<br>
>><br>
>> Barry<br>
>><br>
>> A related note is that "we" are not particularly careful about "reusing" work vectors; say a code has ten different work vectors for different phases of the computation; now imagine a careful "global analysis" that determined it could get away with three work vectors (since only at most three had relevant values at any one time), now pop those three work vectors into faster memory where the ten previous work vectors could not fit. Obviously I am being extreme here to make a point that careful memory decisions could potentially make a difference in complicated codes (and all we are about are complicated codes).<br>
>><br>
>><br>
>><br>
>><br>
>>><br>
>>> --Richard<br>
>>><br>
>>> On Wed, Jun 3, 2015 at 6:50 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
>>><br>
>>> The beauty of git/bitbucket is one can make branches to try out anything they want even if some cranky old conservative PETSc developer thinks it is worse then consorting with the devil.<br>
>>><br>
>>> As I said before I think that "additional argument" to advised_malloc should be a living object which one can change over time as opposed to just a "flag" type argument that only effects the malloc at malloc time. Of course the "living part" can be implemented later.<br>
>>><br>
>>> Barry<br>
>>><br>
>>> Yes, Jed has already transformed himself into a cranky old conservative PETSc developer<br>
>>><br>
>>><br>
>>>> On Jun 3, 2015, at 7:33 PM, Richard Mills <<a href="mailto:rtm@utk.edu">rtm@utk.edu</a>> wrote:<br>
>>>><br>
>>>> Hi Folks,<br>
>>>><br>
>>>> It's been a while, but I'd like to pick up this discussion of adding a context to memory allocations again.<br>
>>>><br>
>>>> The immediate motivation I have is that I'd like to support use of the memkind library (<a href="https://github.com/memkind/memkind" target="_blank">https://github.com/memkind/memkind</a>), though adding a context to PetscMallocN() (or making some other interface, say PetscAdvMalloc() or whatever) could have much broader utility than simply memkind support (which Jed doesn't like anyway, and I share some of his concerns). For the sake of having a concrete example, I'll discuss memkind here.<br>
>>>><br>
>>>> Memkind's memkind_malloc() works like malloc() but takes a memkind_t argument to specify some desired property of the memory being allocated. For example,<br>
>>>><br>
>>>> hugetlb_str = (char *)memkind_malloc(MEMKIND_HUGETLB, size);<br>
>>>><br>
>>>> returns a pointer to memory allocated using huge pages, and<br>
>>>><br>
>>>> hbw_preferred_str = (char *)memkind_malloc(MEMKIND_HBW_PREFERRED, size);<br>
>>>><br>
>>>> allocates memory from a high-bandwidth region if it's available and elsewhere if not (specifying MEMKIND_HBW will insist on the allocation coming from high-bandwidth memory, failing if it's not available).<br>
>>>><br>
>>>> It should be straightforward to add a variant of PetscMalloc() that accepts a context: I'll call this PetscAdvMalloc(), for now, though we can come up with a better name later. This will allow passing on the memkind_t via this context to the underlying memkind allocator, and we can have some mechanism to set a default context (in the case of Memkind, this is likely MEMKIND_DEFAULT) that gets used when plain PetscMalloc() gets called.<br>
>>>><br>
>>>> Of course, we'll need some way to ensure that the "advanced malloc" gets used to allocated the critical data structures. As a low-level way to start, it may make sense to simply add a way to stash a context in Vec and Mat objects. Maybe have VecSetAdvMallocCtx(), and if that context gets set, then PetscAdvMalloc() is used for the allocations associated with the contents of that object. It would probably be better to eventually have a higher-level way to do this, e.g., support standard settings in the options database that PETSc uses to construct the appropriate arguments to underlying allocators that are supported, but I think just adding a way to set this context directly is an appropriate first step.<br>
>>>><br>
>>>> Does this sound like a reasonable thing for me to prototype, or are others thinking something very different? Please let me know. I'm getting more access to early systems I can experiment on, and I'd really like to move forward on trying things with high bandwidth memory (imperfect as our APIs for using it are).<br>
>>>><br>
>>>> Best regards,<br>
>>>> Richard<br>
>>>><br>
>>>><br>
>>>> On Wed, Apr 29, 2015 at 11:10 PM, Richard Mills <<a href="mailto:rtm@utk.edu">rtm@utk.edu</a>> wrote:<br>
>>>> On Wed, Apr 29, 2015 at 1:28 PM, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>
>>>><br>
>>>> Forget about the issue of "changing" PetscMallocN() or adding a new interface instead, that is a minor syntax and annoyance issue:<br>
>>>><br>
>>>> The question is "is it worth exploring adding a context for certain memory allocations that would allow us to "do" various things to the memory and "indicate" properties of the memory"? I think, though I agree with Jed that it could be fraught with difficulties, that is is worthwhile playing around with this.<br>
>>>><br>
>>>> Barry<br>
>>>><br>
>>>><br>
>>>> I vote "yes". One might want to, say<br>
>>>><br>
>>>> * Give hints via something like madvise() on how/when the memory might be accessed.<br>
>>>> * Specify a preferred "kind" of memory (and behavior if the preferred kind is not available, or perhaps even specify a priority on how hard to try to get the preferred memory kind)<br>
>>>> * Specify something like a preference to interleave allocation blocks between different kinds of memory<br>
>>>><br>
>>>> I'm sure we can come up with plenty of other possibilities, some of which might actually be useful, many of which will be useful only for very contrived cases, and some that are not useful today but may become useful as memory systems evolve.<br>
>>>><br>
>>>> --Richard<br>
>>>><br>
>>><br>
>>><br>
>><br>
><br>
<br>
</div></div></blockquote></div><br></div></div></div>