[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

Junchao Zhang junchao.zhang at gmail.com
Fri Aug 11 09:52:09 CDT 2023


Hi, Marcos,
  Could you build petsc in debug mode and then copy and paste the whole
error stack message?

   Thanks
--Junchao Zhang


On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> Hi, I'm trying to run a parallel matrix vector build and linear solution
> with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix
> build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda
> enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the
> following error:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   * what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress:
> an illegal memory access was encountered*
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> I'm new to submitting jobs in slurm that also use GPU resources, so I
> might be doing something wrong in my submission script. This is it:
>
> #!/bin/bash
> #SBATCH -J test
> #SBATCH -e /home/Issues/PETSc/test.err
> #SBATCH -o /home/Issues/PETSc/test.log
> #SBATCH --partition=batch
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:1
>
> export OMP_NUM_THREADS=1
> module load cuda/11.5
> module load openmpi/4.1.1
>
> cd /home/Issues/PETSc
> *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds *-vec_type
> mpicuda -mat_type mpiaijcusparse -pc_type gamg*
>
> If anyone has any suggestions on how o troubleshoot this please let me
> know.
> Thanks!
> Marcos
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230811/b150bfc6/attachment.html>


More information about the petsc-users mailing list