[petsc-users] hypre / hip usage
Mark Adams
mfadams at lbl.gov
Mon Jan 24 08:24:07 CST 2022
What is the fastest way to rebuild hypre? reconfiguring did not work and is
slow.
I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT (no
debuggers other than valgrind on Crusher??!?!) and I get to a hypre call:
PetscStackCallStandard(HYPRE_IJMatrixAddToValues,(hA->ij,1,&hnc,(HYPRE_BigInt*)(rows+i),(HYPRE_BigInt*)cscr[0],sscr));
This is from DMPlexComputeJacobian_Internal and MatSetClosure.
HYPRE_IJMatrixAddToValues is successfully called in earlier parts of the
run.
The args look OK, so I am going into HYPRE_IJMatrixAddToValues.
Thanks,
Mark
On Sun, Jan 23, 2022 at 9:55 AM Mark Adams <mfadams at lbl.gov> wrote:
> Stefano and Matt, This segv looks like a Plexism.
>
> + srun -n1 -N1 --ntasks-per-gpu=1 --gpu-bind=closest ../ex13
> -dm_plex_box_faces 2,2,2 -petscpartitioner_simple_process_grid
> 1,1,1 -dm_plex_box_upper 1,1,1 -petscpartitioner_simple_node_grid 1,1,1
> -dm_refine 2 -dm_view -malloc_debug -log_trace -pc_type hypre -dm_vec_type
> hip -dm_mat_type hypre
> + tee out_001_kokkos_Crusher_2_8_hypre.txt
> [0] 1.293e-06 Event begin: DMPlexSymmetrize
> [0] 8.9463e-05 Event end: DMPlexSymmetrize
> .....
> [0] 0.554529 Event end: VecHIPCopyFrom
> [0] 0.559891 Event begin: DMCreateInterp
> [0] 0.560603 Event begin: DMPlexInterpFE
> [0] 0.566707 Event begin: MatAssemblyBegin
> [0] 0.566962 Event begin: BuildTwoSidedF
> [0] 0.567068 Event begin: BuildTwoSided
> [0] 0.567119 Event end: BuildTwoSided
> [0] 0.567154 Event end: BuildTwoSidedF
> [0] 0.567162 Event end: MatAssemblyBegin
> [0] 0.567164 Event begin: MatAssemblyEnd
> [0] 0.567356 Event end: MatAssemblyEnd
> [0] 0.572884 Event begin: MatAssemblyBegin
> [0] 0.57289 Event end: MatAssemblyBegin
> [0] 0.572892 Event begin: MatAssemblyEnd
> [0] 0.573978 Event end: MatAssemblyEnd
> [0] 0.574428 Event begin: MatZeroEntries
> [0] 0.579998 Event end: MatZeroEntries
> :0:rocdevice.cpp :2589: 257935591316 us: Device::callbackQueue
> aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to
> access an inaccessible address. code: 0x2b
> srun: error: crusher001: task 0: Aborted
> srun: launch/slurm: _step_signal: Terminating StepId=65929.4
> + date
> Sun 23 Jan 2022 09:46:55 AM EST
>
> On Sun, Jan 23, 2022 at 8:15 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> Thanks,
>> '-mat_type hypre' was failing for me. I could not find a test that used
>> it and I was not sure it was considered functional.
>> I will look at it again and collect a bug report if needed.
>>
>> On Sat, Jan 22, 2022 at 11:31 AM Stefano Zampini <
>> stefano.zampini at gmail.com> wrote:
>>
>>> Mark
>>>
>>> the two options are only there to test the code in CI, and are not
>>> needed in general
>>>
>>> '--download-hypre-configure-arguments=--enable-unified-memory',
>>> This is only here to test the unified memory code path
>>>
>>> '--with-hypre-gpuarch=gfx90a',
>>> This is not needed if rocminfo is in PATH
>>>
>>> Our interface code with HYPRE GPU works fine for HIP, it is tested in CI.
>>> The -mat_type hypre assembling for ex19 does not work because ex19 uses
>>> FDColoring. Just assemble in mpiaij format (look at runex19_hypre_hip in
>>> src/snes/tutorials/makefile); the interface code will copy the matrix to
>>> the GPU
>>>
>>> Il giorno ven 21 gen 2022 alle ore 19:24 Mark Adams <mfadams at lbl.gov>
>>> ha scritto:
>>>
>>>>
>>>>
>>>> On Fri, Jan 21, 2022 at 11:14 AM Jed Brown <jed at jedbrown.org> wrote:
>>>>
>>>>> "Paul T. Bauman" <ptbauman at gmail.com> writes:
>>>>>
>>>>> > On Fri, Jan 21, 2022 at 8:52 AM Paul T. Bauman <ptbauman at gmail.com>
>>>>> wrote:
>>>>> >> Yes. The way HYPRE's memory model is setup is that ALL GPU
>>>>> allocations are
>>>>> >> "native" (i.e. [cuda,hip]Malloc) or, if unified memory is enabled,
>>>>> then ALL
>>>>> >> GPU allocations are unified memory (i.e. [cuda,hip]MallocManaged).
>>>>> >> Regarding HIP, there is an HMM implementation of hipMallocManaged
>>>>> planned,
>>>>> >> but is it not yet delivered AFAIK (and it will *not* support
>>>>> gfx906, e.g.
>>>>> >> RVII, FYI), so, today, under the covers, hipMallocManaged is calling
>>>>> >> hipHostMalloc. So, today, all your unified memory allocations in
>>>>> HYPRE on
>>>>> >> HIP are doing CPU-pinned memory accesses. And performance is just
>>>>> truly
>>>>> >> terrible (as you might expect).
>>>>>
>>>>> Thanks for this important bit of information.
>>>>>
>>>>> And it sounds like when we add support to hand off Kokkos matrices and
>>>>> vectors (our current support for matrices on ROCm devices uses Kokkos) or
>>>>> add direct support for hipSparse, we'll avoid touching host memory in
>>>>> assembly-to-solve with hypre.
>>>>>
>>>>
>>>> It does not look like anyone has made Hypre work with HIP. Stafano
>>>> added a runex19_hypre_hip target 4 months ago and hypre.py has some HIP
>>>> things.
>>>>
>>>> I have a user that would like to try this, no hurry but, can I get an
>>>> idea of a plan for this?
>>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>>
>>>
>>>
>>> --
>>> Stefano
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20220124/7e87d390/attachment.html>
More information about the petsc-users
mailing list