[petsc-dev] [petsc-users] CUDA MatSetValues test
Stefano Zampini
stefano.zampini at gmail.com
Wed Jun 2 02:06:12 CDT 2021
Barry
If I remember correctly, the ctable business was a limitation of Mark's
initial implementation, which I removed here
https://gitlab.com/petsc/petsc/-/merge_requests/3411/diffs?commit_id=36173707431f5583889720868beae9046e6e1fd2.
Here colmap was "int" already. We made a mistake letting this branch fall
so behind main. I will add my observations in the MR
Il giorno mer 2 giu 2021 alle ore 09:20 Barry Smith <bsmith at petsc.dev> ha
scritto:
>
> Switching the thread to petsc-dev since users don't need this.
>
> Had a nasty failure on one the CI, turned out the GPU colmap was
> declared as PetscInt* but it was copied down from the CPU into the GPU
> memory as int. The power of void* interface. Thus ex5cu.cu which was
> never CI tested was crashing very strangely. By declaring it properly as
> int* I got the example to run with 64 bit indices. Submitted a new pipeline
> to see what else pops up.
>
> I am a bit worried about the csr2csc data structures and code, one line
> declaring csr2csc_i inside the struct disappeared in the rebase on main, I
> stuck it in but don't understand that code at all. I'll report further when
> the CI is done or crashes. I will also need to make sure it is compiled pre
> and post PETSC_PKG_CUDA_VERSION_GE(11,0,0)
>
> I will also investigate this CTABLE business, the GPU doesn't need to
> know or care if the CPU is using a CTABLE or not; it just needs the colmap
> array format that it gets from spreading out the garray anyway to be copied
> to the GPU so PETSc can likely be built with CTABLES as normal; this will
> improve the testing a lot.
>
>
>
> Barry
>
>
> On May 30, 2021, at 11:54 AM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
> I believe I have finally successfully rebased the branch barry/2020-11-11/cleanup-matsetvaluesdevice
> against main and cleaned up all the issues. Please read the commit message.
>
> I have submitted a CI pipeline with ctables turned off temporarily for
> testing of the MatSetValuesDevice(). If it works hopefully Mark can maybe
> run a few additional tests of his Landau code that are not in the usual
> testing to verify and we can finally get the branch into main.
>
> Mark,
>
> Since this change is involved, it is likely your Landau mass matrix
> branch may not rebase cleanly. Let me know if you would like me to do the
> rebase and testing of your Landau mass matrix branch. I can get it ready to
> work with the results of barry/2020-11-11/cleanup-matsetvaluesdevice and
> then hand it back to you for further development.
>
> Barry
>
>
> On May 29, 2021, at 11:32 AM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
> I am working away on this branch, making some progress, also cleaning
> things up with some small simplifications. Hope I can succeed, a bunch of
> stuff got moved around and some structs had changes, the merge could not
> handle some of these so I have to do a good amount of code wrangling to fix
> it.
>
> I'll let you know as I progress.
>
> Barry
>
>
> On May 28, 2021, at 10:53 PM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
> I have rebased and tried to fix everything. I am now fixing the issues
> of --download-openmpi and cuda, once that is done I will test, rebase with
> main again if needed and restart the MR and get it into main.
>
> Barry
>
> I was stupid to let the MR lay fallow, I should have figured out a
> solution to the openmpi and cuda issue instead of punting and waiting for a
> dream fix.
>
>
>
> On May 28, 2021, at 2:39 PM, Mark Adams <mfadams at lbl.gov> wrote:
>
> Thanks,
>
> I did not intend to make any (real) changes.
> The only thing that I did not intend to use from Barry's branch, that
> conflicted, was the help and comment block at the top of ex5cu.cu
>
> * I ended up with two declarations of PetscSplitCSRDataStructure
> * I added some includes to fix errors like this:
> /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263):
> error: incomplete type is not allowed
> * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get:
> /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu(1348): error: class "Mat_SeqAIJCUSPARSE" has no member
> "csr2csc_i"
>
>
>
>
> On Fri, May 28, 2021 at 3:13 PM Stefano Zampini <stefano.zampini at gmail.com>
> wrote:
>
>> I can take a quick look at it tomorrow, what are the main changes you
>> made since then?
>>
>> On May 28, 2021, at 9:51 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> I am getting messed up in trying to resolve conflicts in rebasing over
>> main.
>> Is there a better way of doing this?
>> Can I just tell git to use Barry's version and then test it?
>> Or should I just try it again?
>>
>> On Fri, May 28, 2021 at 2:15 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> I am rebasing over main and its a bit of a mess. I must have missed
>>> something. I get this. I think the _n_SplitCSRMat must be wrong.
>>>
>>>
>>> In file included from
>>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0:
>>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting
>>> types for 'PetscSplitCSRDataStructure'
>>> typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure;
>>> ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous
>>> declaration of 'PetscSplitCSRDataStructure' was here
>>> typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure;
>>> ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>> CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o
>>>
>>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini <
>>> stefano.zampini at gmail.com> wrote:
>>>
>>>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures
>>>> using cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only
>>>> weakly, it adds a print if cuda is present)
>>>> Since eventually the MPI distro will only need a hint to be configured
>>>> with CUDA, why not removing the dependency at all and add only a flag
>>>> —download-openmpi-use-cuda?
>>>>
>>>> On May 28, 2021, at 8:44 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>
>>>> Stefano, who has a far better memory than me, wrote
>>>>
>>>> > Or probably remove —download-openmpi ? Or, just for the moment, why
>>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, so
>>>> that it will be forced to be configured later?
>>>>
>>>> MPI.py depends on cuda.py so we cannot also have cuda.py depend on
>>>> MPI.py using the generic dependencies of configure/packages
>>>>
>>>> but perhaps we can just hardwire the rerunning of cuda.py when the
>>>> MPI compilers are reset. I will try that now and if I can get it to work we
>>>> should be able to move those old fix branches along as MR.
>>>>
>>>> Barry
>>>>
>>>>
>>>>
>>>> On May 28, 2021, at 12:41 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>> OK, I will try to rebase and test Barry's branch.
>>>>
>>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini <
>>>> stefano.zampini at gmail.com> wrote:
>>>>
>>>>> Yes, it is the branch I was using before force pushing to
>>>>> Barry’s barry/2020-11-11/cleanup-matsetvaluesdevice
>>>>> You can use both I guess
>>>>>
>>>>> On May 28, 2021, at 8:25 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>> Is this the correct branch? It conflicted with ex5cu so I assume it is.
>>>>>
>>>>>
>>>>> stefanozampini/simplify-setvalues-device
>>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device>
>>>>>
>>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>
>>>>>> I am fixing rebasing this branch over main.
>>>>>>
>>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini <
>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>
>>>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why
>>>>>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, so
>>>>>>> that it will be forced to be configured later?
>>>>>>>
>>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini <
>>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>>
>>>>>>> That branch provides a fix for MatSetValuesDevice but it never got
>>>>>>> merged because of the CI issues with the —download-openmpi. We can probably
>>>>>>> try to skip the test in that specific configuration?
>>>>>>>
>>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <bsmith at petsc.dev> wrote:
>>>>>>>
>>>>>>>
>>>>>>> ~/petsc/src/mat/tutorials*
>>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=)*
>>>>>>> arch-robustify-cuda-gencodearch-check
>>>>>>> $ ./ex5cu
>>>>>>> terminate called after throwing an instance of
>>>>>>> 'thrust::system::system_error'
>>>>>>> what(): fill_n: failed to synchronize: cudaErrorIllegalAddress:
>>>>>>> an illegal memory access was encountered
>>>>>>> Aborted (core dumped)
>>>>>>>
>>>>>>> requires: cuda !define(PETSC_USE_CTABLE)
>>>>>>>
>>>>>>> CI does not test with CUDA and no ctable. The code is still
>>>>>>> broken as it was six months ago in the discussion Stefano pointed to. It is
>>>>>>> clear why just no one has had the time to clean things up.
>>>>>>>
>>>>>>> Barry
>>>>>>>
>>>>>>>
>>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini <
>>>>>>> stefano.zampini at gmail.com> wrote:
>>>>>>>
>>>>>>>> If you are referring to your device set values, I guess it is not
>>>>>>>> currently tested
>>>>>>>>
>>>>>>>
>>>>>>> No. There is a test for that (ex5cu).
>>>>>>> I have a user that is getting a segv in MatSetValues with
>>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying to cover
>>>>>>> all the bases.
>>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR for
>>>>>>> it if such a test does not exist.
>>>>>>>
>>>>>>>
>>>>>>>> See the discussions here
>>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411
>>>>>>>> I started cleaning up the code to prepare for testing but we never
>>>>>>>> finished it
>>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/
>>>>>>>>
>>>>>>>>
>>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>>
>>>>>>>> Is there a test with MatSetValues and CUDA?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>
>
>
>
>
--
Stefano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20210602/c74551db/attachment-0001.html>
More information about the petsc-dev
mailing list