[petsc-dev] cuda failures of tests in master

Sat Aug 8 17:50:35 CDT 2015

On 08/08/2015 12:13 PM, Barry Smith wrote:
>
>     Understood. I found this in the manual page for BJACOBI :-)
>
>     Developer Notes: This preconditioner does not currently work with CUDA/CUSP for a couple of reasons.
>         (1) It creates seq vectors as work vectors that should be cusp
>         (2) The use of VecPlaceArray() is not handled properly by CUSP (that is it will not know where
>             the ownership of the vector is so may use wrong values) even if it did know the ownership
>             it may induce extra copy ups and downs. Satish suggests a VecTransplantArray() to handle two
>             vectors sharing the same pointer and handling the CUSP side as well instead of VecGetArray()/VecPlaceArray().

Note that this comment is obsolete for the single block case. In that 
case we now use VecGetLocalVector, etc. which have reasonable ownership 
semantics and which handle cache invalidation correctly. I left the 
comment in because I haven't had a chance to address the multi block 
case yet. I think with the proposed changes from my email the multi 
block case should work correctly, albeit not with optimal performance 
due to spurious cache invalidations.

>
>
>> On Aug 8, 2015, at 10:23 AM, Dominic Meiser <dmeiser at txcorp.com> wrote:
>>
>> With the current implementation the following can happen (v is of type VECCUSP):
>> - Originally data on GPU, v.valid_GPU_array == PETSC_CUSP_GPU
>> - a call to VecPlaceArray(v, arr) unplaces the data on the host and sets v.valid_CPU_array=CPU. Note that the GPU data does not get stashed.
>> - subsequent accesses of the GPU data will clobber the data that was there before VecPlaceArray.
>>
>> I think there are two possible solutions:
>> - In VecPlaceArray_SeqCUSP we allocate a new array on the GPU and stash the current values.
>> - We do a GPU->CPU synchronization in VecPlaceArray_SeqCUSP to make sure that the data on the CPU is up to date.
>
>     Please do this change.  I agree that any other change introduces more complexity and likely failure. Once you have made this change you can
> remove the comments from the BJACOBI manual page also.

Sounds good. I'll add it to the fix-ex2_bjacobi pull request.

Cheers,
Dominic

>
>     Thanks for figuring this out,
>
>     Barry
>
>
>>
>> It's a space/time tradeoff. Also, the first option further complicates the caching mechanism. I think the caching mechanism is already too complicated (nearly every bug I encounter with the CUDA stuff is related to caching). The second option allows us to more easily reuse VecPlaceArray_Seq. And we don't have to juggle a GPU unplaced array in addition to the host side unplaced array. I'd therefore propose that we take the hit of a GPU->CPU data synchronization. Not that this synchronization only incurs in a PCIe data transfer if the data on the CPU is stale.
>>
>> But all of this is only needed if the semantics of VecPlaceArray/VecResetArray array is to preserve the contents of the unplaced array.
>>
>> Cheers,
>> Dominic
>>
>> On 08/07/2015 09:23 PM, Barry Smith wrote:
>>>
>>>> On Aug 7, 2015, at 7:50 PM, Dominic Meiser <dmeiser at txcorp.com> wrote:
>>>>
>>>> FYI I've opened a pull request that addresses this issue.
>>>>
>>>> While going through the code I ran into a question regarding the semantics of VecPlaceArray and VecResetArray: Is the contents of the "unplaced" array supposed to be preserved so that the vector is completely restored upon calling VecResetArray? With the current implementation of VecPlaceArray_SeqCUSP and VecResetArray_SeqCUSP a situation can occur where the contents of the unplaced array gets clobbered. Does this need to be fixed?
>>>
>>>     Hmm,  I think I always assumed the "unplaced" array was inaccessible during the time it is unplaced (since there is no public pointer to the array), this would mean that the values there shouldn't change. What are the exact details of how they can get changed?
>>>
>>>     Thanks
>>>
>>>     Barry
>>>
>>>>
>>>> Cheers,
>>>> Dominic
>>>>
>>>>
>>>> On 08/07/2015 02:42 PM, Barry Smith wrote:
>>>>>
>>>>>    Guardians of CUDA/GPUs
>>>>>
>>>>> http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2015/08/06/examples_master_arch-cuda-double_bb-proxy.log
>>>>> http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2015/08/06/examples_master_arch-cuda_bb-proxy.log
>>>>>
>>>>> search for ex2_bjacobi
>>>>>
>>>>> note that this example does not fail in non CUDA builds.
>>>>>
>>>>> For some reason the iterative solver thinks it converges in 0 iterations but the answer is completely wrong.
>>>>>
>>>>>    Barry
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Dominic Meiser
>>>> Tech-X Corporation
>>>> 5621 Arapahoe Avenue
>>>> Boulder, CO 80303
>>>> USA
>>>> Telephone: 303-996-2036
>>>> Fax: 303-448-7756
>>>> www.txcorp.com
>>>
>>>
>>
>> --
>> Dominic Meiser
>> Tech-X Corporation
>> 5621 Arapahoe Avenue
>> Boulder, CO 80303
>> USA
>> Telephone: 303-996-2036
>> Fax: 303-448-7756
>> www.txcorp.com
>
>

-- 
Dominic Meiser
Tech-X Corporation
5621 Arapahoe Avenue
Boulder, CO 80303
USA
Telephone: 303-996-2036
Fax: 303-448-7756
www.txcorp.com