[petsc-dev] [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Barry Smith bsmith at petsc.dev
Fri Aug 11 10:47:51 CDT 2023


  Should a default build of PETSc configure both with and without debugging and compile both sets of libraries? Increases the initial build time for people but simplifies life.





> On Aug 11, 2023, at 10:52 AM, Junchao Zhang <junchao.zhang at gmail.com> wrote:
> 
> Hi, Marcos,
>   Could you build petsc in debug mode and then copy and paste the whole error stack message?
> 
>    Thanks
> --Junchao Zhang
> 
> 
> On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>> Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the following error:
>> 
>> terminate called after throwing an instance of 'thrust::system::system_error'
>>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
>> 
>> Program received signal SIGABRT: Process abort signal.
>> 
>> Backtrace for this error:
>> terminate called after throwing an instance of 'thrust::system::system_error'
>>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
>> 
>> Program received signal SIGABRT: Process abort signal.
>> 
>> I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it:
>> 
>> #!/bin/bash
>> #SBATCH -J test
>> #SBATCH -e /home/Issues/PETSc/test.err
>> #SBATCH -o /home/Issues/PETSc/test.log
>> #SBATCH --partition=batch
>> #SBATCH --ntasks=2
>> #SBATCH --nodes=1
>> #SBATCH --cpus-per-task=1
>> #SBATCH --ntasks-per-node=2
>> #SBATCH --time=01:00:00
>> #SBATCH --gres=gpu:1
>> 
>> export OMP_NUM_THREADS=1
>> module load cuda/11.5
>> module load openmpi/4.1.1
>> 
>> cd /home/Issues/PETSc
>> mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg
>> 
>> If anyone has any suggestions on how o troubleshoot this please let me know.
>> Thanks!
>> Marcos
>> 
>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20230811/2ede6f52/attachment.html>


More information about the petsc-dev mailing list