<div dir="ltr"><div><div>cusparse matrix triple product takes a lot of memory. We usually use Kokkos, configured with TPL turned off.</div><div></div></div><div><br></div>If you have a complex problem different parts of the domain can coarsen at different rates.<div>Jacobi instead of asm will save a fair amount od memory. </div><div>If you run with -ksp_view you will see operator/matrix complexity from GAMG. These should be < 1.5,</div><div><br></div><div>Mark</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 18, 2023 at 3:42 PM Mark Lohry <<a href="mailto:mlohry@gmail.com">mlohry@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>With asm I see a range of 8GB-13GB, slightly smaller ratio but that probably explains it (does this still seem like a lot of memory to you for the problem size?)<br></div><div><br></div><div>In general I don't have the same number of blocks per row, so I suppose it makes sense there's some memory imbalance. <br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 18, 2023 at 3:35 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Can your problem have load imbalance?<div><br></div><div>You might try '-pc_type asm' (and/or jacobi) to see your baseline load imbalance.</div><div>GAMG can add some load imbalance but start by getting a baseline.</div><div><br></div><div>Mark</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 18, 2023 at 2:54 PM Mark Lohry <<a href="mailto:mlohry@gmail.com" target="_blank">mlohry@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Q0) does -memory_view trace GPU memory as well, or is there another method to query the peak device memory allocation?</div><div><br></div><div>Q1) I'm loading a aijcusparse matrix with MatLoad, and running with -ksp_type fgmres -pc_type gamg -mg_levels_pc_type asm with mat info 27,142,948 rows and cols, bs=4, total nonzeros 759,709,392. Using 8 ranks on 8x80GB GPUs, and during the setup phase before crashing with CUSPARSE_STATUS_INSUFFICIENT_RESOURCES nvidia-smi shows the below pasted content.</div><div><br></div><div>GPU memory usage spanning from 36GB-50GB but with one rank at 77GB. Is this expected? Do I need to manually repartition this somehow?</div><div><br></div><div>Thanks,</div><div>Mark<br></div><div><br></div><div><br></div><div><p class="MsoNormal">+-----------------------------------------------------------------------------+<u></u><u></u></p>
<p class="MsoNormal">| Processes: |<u></u><u></u></p>
<p class="MsoNormal">| GPU GI CI PID Type Process name GPU Memory |<u></u><u></u></p>
<p class="MsoNormal">| ID ID Usage |<u></u><u></u></p>
<p class="MsoNormal">|=============================================================================|<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696543 C ./petsc_solver_test 38407MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696544 C ./petsc_solver_test 467MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696545 C ./petsc_solver_test 467MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696546 C ./petsc_solver_test 467MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696548 C ./petsc_solver_test 467MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696550 C ./petsc_solver_test 471MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696551 C ./petsc_solver_test 467MiB |<u></u><u></u></p>
<p class="MsoNormal">| 0 N/A N/A 1696552 C ./petsc_solver_test 467MiB |<u></u><u></u></p>
<p class="MsoNormal">| 1 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 1 N/A N/A 1696544 C ./petsc_solver_test 35849MiB |<u></u><u></u></p>
<p class="MsoNormal">| 2 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 2 N/A N/A 1696545 C ./petsc_solver_test 36719MiB |<u></u><u></u></p>
<p class="MsoNormal">| 3 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 3 N/A N/A 1696546 C ./petsc_solver_test 37343MiB |<u></u><u></u></p>
<p class="MsoNormal">| 4 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 4 N/A N/A 1696548 C ./petsc_solver_test 36935MiB |<u></u><u></u></p>
<p class="MsoNormal">| 5 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 5 N/A N/A 1696550 C ./petsc_solver_test 49953MiB |<u></u><u></u></p>
<p class="MsoNormal">| 6 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 6 N/A N/A 1696551 C ./petsc_solver_test 47693MiB |<u></u><u></u></p>
<p class="MsoNormal">| 7 N/A N/A 1630309 C nvidia-cuda-mps-server 27MiB |<u></u><u></u></p>
<p class="MsoNormal">| 7 N/A N/A 1696552 C ./petsc_solver_test 77331MiB |<u></u><u></u></p>
<p class="MsoNormal">+-----------------------------------------------------------------------------+</p></div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>