<div dir="ltr">Hi, Macros,<br>  I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack.  We recently refactored the COO code and got rid of that function.  So could you try petsc/main?<br>  We map MPI processes to GPUs in a round-robin fashion. We query the number of visible CUDA devices (g), and assign the device (rank%g) to the MPI process (rank).   In that sense, the work distribution is totally determined by your MPI work partition (i.e, yourself). <br>  On clusters, this MPI process to GPU binding is usually done by the job scheduler like slurm.  You need to check your cluster's users' guide to see how to bind MPI processes to GPUs. If the job scheduler has done that, the number of visible CUDA devices to a process might just appear to be 1, making petsc's own mapping void.<br><br>   Thanks.<br>--Junchao Zhang<br><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) <<a href="mailto:marcos.vanella@nist.gov">marcos.vanella@nist.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg1722876366411198553">




<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case:</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
terminate called after throwing an instance of 'thrust::system::system_error'
<div>  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</div>
<div><br>
</div>
<div>Program received signal SIGABRT: Process abort signal.</div>
<div><br>
</div>
<div>Backtrace for this error:</div>
<div>#0  0x15264731ead0 in ???</div>
<div>#1  0x15264731dc35 in ???</div>
<div>#2  0x15264711551f in ???</div>
<div>#3  0x152647169a7c in ???</div>
<div>#4  0x152647115475 in ???</div>
<div>#5  0x1526470fb7f2 in ???</div>
<div>#6  0x152647678bbd in ???</div>
<div>#7  0x15264768424b in ???</div>
<div>#8  0x1526476842b6 in ???</div>
<div>#9  0x152647684517 in ???</div>
<div>#10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc</div>
<div>      at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224</div>
<div>#11  0x55bb46342ebb in _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_</div>
<div>      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316</div>
<div>#12  0x55bb46342ebb in _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_</div>
<div>      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544</div>
<div>#13  0x55bb46342ebb in _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_</div>
<div>      at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669</div>
<div>#14  0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_</div>
<div>      at /usr/local/cuda/include/thrust/detail/sort.inl:115</div>
<div>#15  0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_</div>
<div>      at /usr/local/cuda/include/thrust/detail/sort.inl:305</div>
<div>#16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic</div>
<div>      at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/<a href="http://aijcusparse.cu:4452" target="_blank">aijcusparse.cu:4452</a></div>
<div>#17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic</div>
<div>      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/<a href="http://mpiaijcusparse.cu:173" target="_blank">mpiaijcusparse.cu:173</a></div>
<div>#18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE</div>
<div>      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/<a href="http://mpiaijcusparse.cu:222" target="_blank">mpiaijcusparse.cu:222</a></div>
<div>#19  0x55bb468e01cf in MatSetPreallocationCOO</div>
<div>      at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606</div>
<div>#20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND</div>
<div>      at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547</div>
<div>#21  0x55bb469015e5 in MatProductSymbolic</div>
<div>      at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803</div>
<div>#22  0x55bb4694ade2 in MatPtAP</div>
<div>      at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897</div>
<div>#23  0x55bb4696d3ec in MatCoarsenApply_MISK_private</div>
<div>      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283</div>
<div>#24  0x55bb4696eb67 in MatCoarsenApply_MISK</div>
<div>      at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368</div>
<div>#25  0x55bb4695bd91 in MatCoarsenApply</div>
<div>      at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97</div>
<div>#26  0x55bb478294d8 in PCGAMGCoarsen_AGG</div>
<div>      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524</div>
<div>#27  0x55bb471d1cb4 in PCSetUp_GAMG</div>
<div>      at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631</div>
<div>#28  0x55bb464022cf in PCSetUp</div>
<div>      at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994</div>
<div>#29  0x55bb4718b8a7 in KSPSetUp</div>
<div>      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406</div>
<div>#30  0x55bb4718f22e in KSPSolve_Private</div>
<div>      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824</div>
<div>#31  0x55bb47192c0c in KSPSolve</div>
<div>      at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070</div>
<div>#32  0x55bb463efd35 in kspsolve_</div>
<div>      at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320</div>
<div>#33  0x55bb45e94b32 in ???</div>
<div>#34  0x55bb46048044 in ???</div>
<div>#35  0x55bb46052ea1 in ???</div>
<div>#36  0x55bb45ac5f8e in ???</div>
<div>#37  0x1526470fcd8f in ???</div>
<div>#38  0x1526470fce3f in ???</div>
<div>#39  0x55bb45aef55d in ???</div>
<div>#40  0xffffffffffffffff in ???</div>
<div>--------------------------------------------------------------------------</div>
<div>Primary job  terminated normally, but 1 process returned</div>
<div>a non-zero exit code. Per user-direction, the job has been aborted.</div>
<div>--------------------------------------------------------------------------</div>
<div>--------------------------------------------------------------------------</div>
<div>mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited on signal 6 (Aborted).</div>
<div>--------------------------------------------------------------------------</div>
<div><br>
</div>
<div>BTW, I'm curious. If I set n MPI processes, each of them building a part of the linear system, and g GPUs, how does PETSc distribute those n pieces of system matrix and rhs in the g GPUs? Does it do some load balancing algorithm?
 Where can I read about this?</div>
<div>Thank you and best Regards, I can also point you to my code repo in GitHub if you want to take a closer look.</div>
<div><br>
</div>
<div>Best Regards,</div>
<div>Marcos<br>
</div>
<br>
</div>
<div id="m_1722876366411198553appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_1722876366411198553divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>><br>
<b>Sent:</b> Friday, August 11, 2023 10:52 AM<br>
<b>To:</b> Vanella, Marcos (Fed) <<a href="mailto:marcos.vanella@nist.gov" target="_blank">marcos.vanella@nist.gov</a>><br>
<b>Cc:</b> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a> <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
<b>Subject:</b> Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>Hi, Marcos,</div>
<div>  Could you build petsc in debug mode and then copy and paste the whole error stack message?</div>
<div><br>
</div>
   Thanks<br clear="all">
<div>
<div dir="ltr">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
<br>
<div>
<div dir="ltr">On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run
 the job with GPU enabled I get the following error:</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">terminate called after throwing an instance of 'thrust::system::system_error'</span>
<div><span style="font-family:"Courier New",monospace">  <b>what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</b></span></div>
<div><br>
</div>
<div><span style="font-family:"Courier New",monospace">Program received signal SIGABRT: Process abort signal.</span></div>
<div><br>
</div>
<div><span style="font-family:"Courier New",monospace">Backtrace for this error:</span></div>
<div><span style="font-family:"Courier New",monospace">terminate called after throwing an instance of 'thrust::system::system_error'</span></div>
<div><span style="font-family:"Courier New",monospace">  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</span></div>
<div><br>
</div>
<span style="font-family:"Courier New",monospace">Program received signal SIGABRT: Process abort signal.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it:</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">#!/bin/bash
<div>#SBATCH -J test</div>
<div>#SBATCH -e /home/Issues/PETSc/test.err</div>
<div>#SBATCH -o /home/Issues/PETSc/test.log</div>
<div>#SBATCH --partition=batch</div>
<div>#SBATCH --ntasks=2</div>
<div>#SBATCH --nodes=1</div>
<div>#SBATCH --cpus-per-task=1</div>
<div>#SBATCH --ntasks-per-node=2</div>
<div>#SBATCH --time=01:00:00</div>
<div>#SBATCH --gres=gpu:1</div>
<div><br>
</div>
<div>export OMP_NUM_THREADS=1</div>
<div>module load cuda/11.5</div>
<div>module load openmpi/4.1.1</div>
<div><br>
</div>
<div>cd /home/Issues/PETSc</div>
<div><b>mpirun -n 2 </b>/home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds
<b>-vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg</b></div>
<br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">If anyone has any suggestions on how o troubleshoot this please let me know.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Thanks!</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">Marcos<br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
</div>
</div>
</blockquote>
</div>
</div>
</div>

</div></blockquote></div>