<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
terminate called after throwing an instance of 'thrust::system::system_error'
<div class="ContentPasted0"> what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</div>
<div><br class="ContentPasted0">
</div>
<div class="ContentPasted0">Program received signal SIGABRT: Process abort signal.</div>
<div><br class="ContentPasted0">
</div>
<div class="ContentPasted0">Backtrace for this error:</div>
<div class="ContentPasted0">#0 0x15264731ead0 in ???</div>
<div class="ContentPasted0">#1 0x15264731dc35 in ???</div>
<div class="ContentPasted0">#2 0x15264711551f in ???</div>
<div class="ContentPasted0">#3 0x152647169a7c in ???</div>
<div class="ContentPasted0">#4 0x152647115475 in ???</div>
<div class="ContentPasted0">#5 0x1526470fb7f2 in ???</div>
<div class="ContentPasted0">#6 0x152647678bbd in ???</div>
<div class="ContentPasted0">#7 0x15264768424b in ???</div>
<div class="ContentPasted0">#8 0x1526476842b6 in ???</div>
<div class="ContentPasted0">#9 0x152647684517 in ???</div>
<div class="ContentPasted0">#10 0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc</div>
<div class="ContentPasted0"> at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224</div>
<div class="ContentPasted0">#11 0x55bb46342ebb in _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_</div>
<div class="ContentPasted0"> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316</div>
<div class="ContentPasted0">#12 0x55bb46342ebb in _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_</div>
<div class="ContentPasted0"> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544</div>
<div class="ContentPasted0">#13 0x55bb46342ebb in _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_</div>
<div class="ContentPasted0"> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669</div>
<div class="ContentPasted0">#14 0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_</div>
<div class="ContentPasted0"> at /usr/local/cuda/include/thrust/detail/sort.inl:115</div>
<div class="ContentPasted0">#15 0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_</div>
<div class="ContentPasted0"> at /usr/local/cuda/include/thrust/detail/sort.inl:305</div>
<div class="ContentPasted0">#16 0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452</div>
<div class="ContentPasted0">#17 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173</div>
<div class="ContentPasted0">#18 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222</div>
<div class="ContentPasted0">#19 0x55bb468e01cf in MatSetPreallocationCOO</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606</div>
<div class="ContentPasted0">#20 0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547</div>
<div class="ContentPasted0">#21 0x55bb469015e5 in MatProductSymbolic</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803</div>
<div class="ContentPasted0">#22 0x55bb4694ade2 in MatPtAP</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897</div>
<div class="ContentPasted0">#23 0x55bb4696d3ec in MatCoarsenApply_MISK_private</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283</div>
<div class="ContentPasted0">#24 0x55bb4696eb67 in MatCoarsenApply_MISK</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368</div>
<div class="ContentPasted0">#25 0x55bb4695bd91 in MatCoarsenApply</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97</div>
<div class="ContentPasted0">#26 0x55bb478294d8 in PCGAMGCoarsen_AGG</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524</div>
<div class="ContentPasted0">#27 0x55bb471d1cb4 in PCSetUp_GAMG</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631</div>
<div class="ContentPasted0">#28 0x55bb464022cf in PCSetUp</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994</div>
<div class="ContentPasted0">#29 0x55bb4718b8a7 in KSPSetUp</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406</div>
<div class="ContentPasted0">#30 0x55bb4718f22e in KSPSolve_Private</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824</div>
<div class="ContentPasted0">#31 0x55bb47192c0c in KSPSolve</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070</div>
<div class="ContentPasted0">#32 0x55bb463efd35 in kspsolve_</div>
<div class="ContentPasted0"> at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320</div>
<div class="ContentPasted0">#33 0x55bb45e94b32 in ???</div>
<div class="ContentPasted0">#34 0x55bb46048044 in ???</div>
<div class="ContentPasted0">#35 0x55bb46052ea1 in ???</div>
<div class="ContentPasted0">#36 0x55bb45ac5f8e in ???</div>
<div class="ContentPasted0">#37 0x1526470fcd8f in ???</div>
<div class="ContentPasted0">#38 0x1526470fce3f in ???</div>
<div class="ContentPasted0">#39 0x55bb45aef55d in ???</div>
<div class="ContentPasted0">#40 0xffffffffffffffff in ???</div>
<div class="ContentPasted0">--------------------------------------------------------------------------</div>
<div class="ContentPasted0">Primary job terminated normally, but 1 process returned</div>
<div class="ContentPasted0">a non-zero exit code. Per user-direction, the job has been aborted.</div>
<div class="ContentPasted0">--------------------------------------------------------------------------</div>
<div class="ContentPasted0">--------------------------------------------------------------------------</div>
<div class="ContentPasted0">mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited on signal 6 (Aborted).</div>
<div class="ContentPasted0">--------------------------------------------------------------------------</div>
<div class="ContentPasted0"><br>
</div>
<div class="ContentPasted0">BTW, I'm curious. If I set n MPI processes, each of them building a part of the linear system, and g GPUs, how does PETSc distribute those n pieces of system matrix and rhs in the g GPUs? Does it do some load balancing algorithm?
Where can I read about this?</div>
<div class="ContentPasted0">Thank you and best Regards, I can also point you to my code repo in GitHub if you want to take a closer look.</div>
<div class="ContentPasted0"><br>
</div>
<div class="ContentPasted0">Best Regards,</div>
<div class="ContentPasted0">Marcos<br>
</div>
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Junchao Zhang <junchao.zhang@gmail.com><br>
<b>Sent:</b> Friday, August 11, 2023 10:52 AM<br>
<b>To:</b> Vanella, Marcos (Fed) <marcos.vanella@nist.gov><br>
<b>Cc:</b> petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov><br>
<b>Subject:</b> Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>Hi, Marcos,</div>
<div> Could you build petsc in debug mode and then copy and paste the whole error stack message?</div>
<div><br>
</div>
Thanks<br clear="all">
<div>
<div dir="ltr" class="x_gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
</div>
<br>
<div class="x_gmail_quote">
<div dir="ltr" class="x_gmail_attr">On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>> wrote:<br>
</div>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px 0.8ex; border-left:1px solid rgb(204,204,204); padding-left:1ex">
<div class="x_msg-8989966265154195036">
<div dir="ltr">
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run
the job with GPU enabled I get the following error:</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">terminate called after throwing an instance of 'thrust::system::system_error'</span>
<div><span style="font-family:"Courier New",monospace"> <b>what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</b></span></div>
<div><br>
</div>
<div><span style="font-family:"Courier New",monospace">Program received signal SIGABRT: Process abort signal.</span></div>
<div><br>
</div>
<div><span style="font-family:"Courier New",monospace">Backtrace for this error:</span></div>
<div><span style="font-family:"Courier New",monospace">terminate called after throwing an instance of 'thrust::system::system_error'</span></div>
<div><span style="font-family:"Courier New",monospace"> what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered</span></div>
<div><br>
</div>
<span style="font-family:"Courier New",monospace">Program received signal SIGABRT: Process abort signal.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it:</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">#!/bin/bash
<div>#SBATCH -J test</div>
<div>#SBATCH -e /home/Issues/PETSc/test.err</div>
<div>#SBATCH -o /home/Issues/PETSc/test.log</div>
<div>#SBATCH --partition=batch</div>
<div>#SBATCH --ntasks=2</div>
<div>#SBATCH --nodes=1</div>
<div>#SBATCH --cpus-per-task=1</div>
<div>#SBATCH --ntasks-per-node=2</div>
<div>#SBATCH --time=01:00:00</div>
<div>#SBATCH --gres=gpu:1</div>
<div><br>
</div>
<div>export OMP_NUM_THREADS=1</div>
<div>module load cuda/11.5</div>
<div>module load openmpi/4.1.1</div>
<div><br>
</div>
<div>cd /home/Issues/PETSc</div>
<div><b>mpirun -n 2 </b>/home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds
<b>-vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg</b></div>
<br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">If anyone has any suggestions on how o troubleshoot this please let me know.</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">Thanks!</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">Marcos<br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
</div>
</div>
</blockquote>
</div>
</div>
</body>
</html>