[petsc-users] Frontier

Mon Feb 27 01:46:24 CST 2023

Treb, putting this on the list.

Treb has ECP early access to Frontier and has some problems:

** first he has error from hypre:

[0]PETSC ERROR: #1 VecGetArrayForHYPRE() at
/gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/vec/vec/impls/hypre/vhyp.c:95

We had another stack trace that I can not find that came from a Vec
routine, copy to the device as I recall.

** The hypre folks could not do much with that so I suggested using
aijhipsparse and he got this error message.

Looks like just a segv in MatAssemblyEnd_SeqAIJ.

Treb,
1) this error might be reproducible on one processor. Could you try to
scale this problem down.
2) I assume this was built with debugging=1
3) if you can get it to fail on one process then you might be able to get a
good stack trace with a line number with a debugger. GDB is available (on
Crusher) but you need to do a few things.
4) You might see if you can get some AMD help.

Thanks,
Mark

On Mon, Feb 27, 2023 at 4:46 AM David Trebotich <dptrebotich at lbl.gov> wrote:

> Hey Mark-
> This is a new issue that doesn't seem to be hypre. It's not using the
> -mat_type aijhipsparse in this run. Can you interpret these petsc errors?
> Seems like it's just crashing. Wasn't doing this last night. I was using
> less nodes however.
> [39872]PETSC ERROR:
> ------------------------------------------------------------------------
> [39872]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
> the batch system) has told this process to end
> [38267] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1
>   -is_view ascii[:[filename][:[format][:append]]]: Prints object to stdout
> or ASCII file (PetscOptionsGetViewer)
> [42800]PETSC ERROR:
> ------------------------------------------------------------------------
> [38266] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [17088]PETSC ERROR:
> ------------------------------------------------------------------------
> [17088]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
> the batch system) has told this process to end
> ----------------------------------------
> Viewer (-is_view) options:
> [41728]PETSC ERROR:
> ------------------------------------------------------------------------
> [38267] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [10256]PETSC ERROR:
> ------------------------------------------------------------------------
> [10256]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
> the batch system) has told this process to end
>   -is_view draw[:[drawtype][:filename|format]] Draws object
> (PetscOptionsGetViewer)
> [56]PETSC ERROR:
> ------------------------------------------------------------------------
> [38270] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [41128]PETSC ERROR:
> ------------------------------------------------------------------------
>   -is_view binary[:[filename][:[format][:append]]]: Saves object to a
> binary file (PetscOptionsGetViewer)
> [42496]PETSC ERROR:
> ------------------------------------------------------------------------
> [42496]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
> the batch system) has told this process to end
> [38265] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 32768; storage
> space: 0 unneeded,378944 used
> [10256]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
>   -is_view ascii[:[filename][:[format][:append]]]: Prints object to stdout
> or ASCII file (PetscOptionsGetViewer)
> [24]PETSC ERROR:
> ------------------------------------------------------------------------
> [24]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [38268] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 32768; storage
> space: 0 unneeded,378944 used
> [4128]PETSC ERROR:
> ------------------------------------------------------------------------
>   -is_view socket[:port]: Pushes object to a Unix socket
> (PetscOptionsGetViewer)
> [60]PETSC ERROR:
> ------------------------------------------------------------------------
> [60]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [38269] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 32768; storage
> space: 0 unneeded,378944 used
> [10260]PETSC ERROR:
> ------------------------------------------------------------------------
>   -is_view draw[:[drawtype][:filename|format]] Draws object
> (PetscOptionsGetViewer)
> [28]PETSC ERROR:
> ------------------------------------------------------------------------
> [28]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [38265] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> [4132]PETSC ERROR:
> ------------------------------------------------------------------------
>   -is_view binary[:[filename][:[format][:append]]]: Saves object to a
> binary file (PetscOptionsGetViewer)
> [56]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
> [38268] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> MPICH ERROR [Rank 10260] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier01491] - Abort(59) (rank 10260 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process\
>  10260
>
> aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 10260
>   -is_view draw[:[drawtype][:filename|format]] Draws object
> (PetscOptionsGetViewer)
> [28]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [38268] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 196
> [4132]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
>   -is_view saws[:communicatorname]: Publishes object to SAWs
> (PetscOptionsGetViewer)
> [60]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [38269] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 196
> [24]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
>   -is_view socket[:port]: Pushes object to a Unix socket
> (PetscOptionsGetViewer)
> [4128]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [38271] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> [56]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
>   -is_view saws[:communicatorname]: Publishes object to SAWs
> (PetscOptionsGetViewer)
> [13736]PETSC ERROR:
> ------------------------------------------------------------------------
> [13736]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
> the batch system) has told this process to end
> [38265] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [28]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [6801] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374781
> [4132]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [38268] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [60]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [6800] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374781
> [13740]PETSC ERROR:
> ------------------------------------------------------------------------
> [13740]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
> the batch system) has told this process to end
> [38264] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 32768; storage
> space: 0 unneeded,378944 used
> [24]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [6804] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374781
> [4128]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [38266] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> [56]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [6805] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374781
> [13736]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [38269] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [28]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [6802] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 0; storage
> space: 0 unneeded,0 used
> [4132]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [38264] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> [60]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [6806] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 0; storage
> space: 0 unneeded,0 used
> [13740]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> [38267] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> [24]PETSC ERROR: to get more information on the crash.
> [6802] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> [4128]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [38270] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> [56]PETSC ERROR: to get more information on the crash.
> [6803] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 0; storage
> space: 0 unneeded,0 used
> [13736]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [13736]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [38264] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 196
> [28]PETSC ERROR: to get more information on the crash.
> [6806] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> [4128]PETSC ERROR: to get more information on the crash.
> [38264] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 0)/(num_localrows 32768) < 0.6. Do not use CompressedRow routines.
> [60]PETSC ERROR: to get more information on the crash.
> [6802] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> [13740]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
> https://petsc.org/release/faq/
> [38265] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> MPICH ERROR [Rank 24] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier00021] - Abort(59) (rank 24 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process 24
>
> [6807] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 32768 X 0; storage
> space: 0 unneeded,0 used
> [4132]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [38268] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> MPICH ERROR [Rank 56] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier00025] - Abort(59) (rank 56 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process 56
>
> [6803] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> [13736]PETSC ERROR: to get more information on the crash.
> [38269] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> MPICH ERROR [Rank 28] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier00021] - Abort(59) (rank 28 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process 28
>
> [6806] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> [4132]PETSC ERROR: to get more information on the crash.
> [38264] <mat> MatSeqAIJCheckInode(): Found 32768 nodes out of 32768 rows.
> Not using Inode routines
> MPICH ERROR [Rank 60] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier00025] - Abort(59) (rank 60 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process 60
>
> aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 60
> [6807] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during
> MatSetValues() is 0
> [13740]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> [38266] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374782
> aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 24
> [6802] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 32768)/(num_localrows 32768) > 0.6. Use CompressedRow routines.
> MPICH ERROR [Rank 4128] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier00609] - Abort(59) (rank 4128 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process 4\
> 128
> [38271] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374781
> aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 56
> [6803] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> MPICH ERROR [Rank 13736] [job id 1277040.1] [Sun Feb 26 22:32:21 2023]
> [frontier01996] - Abort(59) (rank 13736 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 59) - process\
>  13736
>
> [38267] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374782
> aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 28
> [6806] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows
> 32768)/(num_localrows 32768) > 0.6. Use CompressedRow routines.
>
> [38270] <sys> PetscCommDuplicate(): Using internal PETSc communicator
> 1140850689 -2080374781
> [13740]PETSC ERROR: to get more information on the crash.
> [6807] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0
> [4160]PETSC ERROR:
> ------------------------------------------------------------------------
> [4160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
> batch system) has told this process to end
>
> On Sat, Feb 25, 2023 at 11:01 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> There is something here. It looks like an error from hypre, but you do
>> have some sort of stack trace.
>> PETSc is catching an error here:
>>
>> [0]PETSC ERROR: #1 VecGetArrayForHYPRE() at
>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/vec/vec/impls/hypre/vhyp.c:95
>>
>> You might send this whole output to PETSc and see if someone can help.
>>
>> Mark
>>
>>
>> On Sat, Feb 25, 2023 at 12:37 PM David Trebotich <dptrebotich at lbl.gov>
>> wrote:
>>
>>> from the 8192 node run:
>>> [0]PETSC ERROR: --------------------- Error Message
>>> --------------------------------------------------------------
>>> [32016] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [30655] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: Invalid argument
>>> [32017] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [30653] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: HYPRE_MEMORY_DEVICE expects a device vector. You need to
>>> enable PETSc device support, for example, in some cases, -vec_type cuda
>>> [32018] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [30649] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [0]PETSC ERROR: WARNING! There are option(s) set that were not used!
>>> Could be the program crashed before they were used or a spelling mistake,
>>> etc!
>>> [32021] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [30654] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: Option left: name:-diff_ksp_converged_reason (no value)
>>> [32019] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [30652] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: Option left: name:-diff_ksp_max_it value: 50
>>> [32023] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [30650] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: Option left: name:-diff_ksp_norm_type value:
>>> unpreconditioned
>>> [32020] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [30648] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [0]PETSC ERROR: Option left: name:-diff_ksp_rtol value: 1.e-6
>>> [32022] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [32021] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: Option left: name:-diff_ksp_type value: gmres
>>> [609] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [32017] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [0]PETSC ERROR: Option left: name:-diff_pc_type value: jacobi
>>> [611] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [32018] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: Option left: name:-options_left (no value)
>>> [615] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [32016] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [0]PETSC ERROR: Option left: name:-proj-mac_mat_type value: aijhipsparse
>>> [608] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [610] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> then further down
>>> [10191] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [1978] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #1 VecGetArrayForHYPRE() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/vec/vec/impls/hypre/vhyp.c:95
>>> [10184] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [1977] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [0]PETSC ERROR: #2 VecHYPRE_IJVectorPushVecRead() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/vec/vec/impls/hypre/vhyp.c:138
>>> [10185] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [10186] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #3 PCApply_HYPRE() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/pc/impls/hypre/hypre.c:433
>>> [6081] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [10188] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #4 PCApply() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/pc/interface/precon.c:441
>>> [6083] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [10189] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #5 PCApplyBAorAB() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/pc/interface/precon.c:711
>>> [6085] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [10190] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #6 KSP_PCApplyBAorAB() at
>>> /gpfs/alpine/world-shared/geo127/petsc_treb/petsc/include/petsc/private/kspimpl.h:416
>>> [6086] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [10191] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #7 KSPGMRESCycle() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/ksp/impls/gmres/gmres.c:147
>>> [6087] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [10187] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [0]PETSC ERROR: #8 KSPSolve_GMRES() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/ksp/impls/gmres/gmres.c:228
>>> [6080] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [10184] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374780
>>> [0]PETSC ERROR: #9 KSPSolve_Private() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/ksp/interface/itfunc.c:899
>>> [6082] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [10185] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374778
>>> [0]PETSC ERROR: #10 KSPSolve() at
>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/ksp/ksp/interface/itfunc.c:1071
>>> [6084] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>> [10189] <sys> PetscCommDuplicate(): Using internal PETSc communicator
>>> 1140850689 -2080374783
>>>
>>>
>>> On Fri, Feb 24, 2023 at 3:57 PM David Trebotich <dptrebotich at lbl.gov>
>>> wrote:
>>>
>>>> good idea. the global one is unused for small problems. waiting for
>>>> large job to run to see if this fixes that problem.
>>>>
>>>> On Fri, Feb 24, 2023 at 11:04 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> I think you added the prefixes like a year ago, so the prefixes should
>>>>> work.
>>>>> Try both and see which one is used with -options_left
>>>>>
>>>>> On Fri, Feb 24, 2023 at 1:14 PM David Trebotich <dptrebotich at lbl.gov>
>>>>> wrote:
>>>>>
>>>>>> I am using 3.18.4.
>>>>>>
>>>>>> Is aijhipsparse the global -mat_type or should this be the prefixed
>>>>>> one for the solve where I was getting the problem, i.e., -proj_mac_mat_type
>>>>>> aijhipsparse
>>>>>>
>>>>>> On Fri, Feb 24, 2023 at 9:28 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>
>>>>>>> Oh, its 'aijhipsparse'
>>>>>>> And you def want v3.18.4
>>>>>>>
>>>>>>> On Fri, Feb 24, 2023 at 11:29 AM David Trebotich <
>>>>>>> dptrebotich at lbl.gov> wrote:
>>>>>>>
>>>>>>>> I rana small problem with
>>>>>>>> -proj_mac_mat_type hipsparse
>>>>>>>> and get
>>>>>>>> [10]PETSC ERROR: --------------------- Error Message
>>>>>>>> --------------------------------------------------------------
>>>>>>>> [10]PETSC ERROR: Unknown type. Check for miss-spelling or missing
>>>>>>>> package:
>>>>>>>> https://petsc.org/release/install/install/#external-packages
>>>>>>>> [10]PETSC ERROR: Unknown Mat type given: hipsparse
>>>>>>>>
>>>>>>>> On Fri, Feb 24, 2023 at 4:09 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 24, 2023 at 3:35 AM David Trebotich <
>>>>>>>>> dptrebotich at lbl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> More info from the stack. This is a full machine run on Frontier
>>>>>>>>>> and I get this before I get into the first solve. It may or may not be same
>>>>>>>>>> error as before but hopefully there's more here for you to debug.
>>>>>>>>>> [1540]PETSC ERROR:
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> [1540]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>>>>>>>> Violation, probably memory access out of range
>>>>>>>>>> [1540]PETSC ERROR: Try option -start_in_debugger or
>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>> [1540]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
>>>>>>>>>> and https://petsc.org/release/faq/
>>>>>>>>>> [1540]PETSC ERROR: ---------------------  Stack Frames
>>>>>>>>>> ------------------------------------
>>>>>>>>>> [1540]PETSC ERROR: The line numbers in the error traceback are
>>>>>>>>>> not always exact.
>>>>>>>>>> [1540]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
>>>>>>>>>> [1540]PETSC ERROR: #2 MatBindToCPU_HYPRE() at
>>>>>>>>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/mat/impls/hypre/mhypre.c:1255
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This looks like the copy to device call:
>>>>>>>>>
>>>>>>>>> src/mat/impls/hypre/mhypre.c:1260:
>>>>>>>>>  PetscCallExternal(hypre_ParCSRMatrixMigrate, parcsr, hmem);
>>>>>>>>>
>>>>>>>>> This makes sense. You assemble it in the host and it gets sent to
>>>>>>>>> the device.
>>>>>>>>>
>>>>>>>>> I assume you are using -mat_type hypre.
>>>>>>>>> To get moving you could try -mat_type hipsparse
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1540]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at
>>>>>>>>>> /gpfs/alpine/geo127/world-shared/petsc_treb/petsc/src/mat/impls/hypre/mhypre.c:1332
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 22, 2023 at 9:26 AM Li, Rui Peng <li50 at llnl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi David,
>>>>>>>>>>>
>>>>>>>>>>> I am not sure how much information I get here for this segfault.
>>>>>>>>>>> All I can see is you wanted to migrate (copy) a matrix (on device?) to
>>>>>>>>>>> host, and it failed at somewhere in the function. The function itself looks
>>>>>>>>>>> simple and fine to me. We may need to check if everything is sane prior to
>>>>>>>>>>> the point. I am happy to help further.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> -Rui Peng
>>>>>>>>>>>
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov>
>>>>>>>>>>> Sent: Wednesday, February 22, 2023 9:17 AM
>>>>>>>>>>> To: Yang, Ulrike Meier
>>>>>>>>>>> Cc: Li, Rui Peng; MFAdams at LBL.GOV
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>> Hi Ulrike, Rui Peng-
>>>>>>>>>>>
>>>>>>>>>>> I am running into a hypre problem on Frontier. I already passed
>>>>>>>>>>> it by Mark and here is what we get out of the stack:
>>>>>>>>>>> [1704]PETSC ERROR:
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [1704]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>>>>>>>>> Violation, probably memory access out of range
>>>>>>>>>>> [1704]PETSC ERROR: Try option -start_in_debugger or
>>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>>> [1704]PETSC ERROR: or see
>>>>>>>>>>> https://petsc.org/release/faq/#valgrind<
>>>>>>>>>>> https://urldefense.us/v3/__https://petsc.org/release/faq/*valgrind__;Iw!!G2kpM7uM-TzIFchu!kUAHuocSRof5_aTlZtjYLNna1q86tr06UuRvUcmqBdCqWkovEx-X9Y-Md5I8Mcw$>
>>>>>>>>>>> and https://petsc.org/release/faq/<
>>>>>>>>>>> https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G2kpM7uM-TzIFchu!kUAHuocSRof5_aTlZtjYLNna1q86tr06UuRvUcmqBdCqWkovEx-X9Y-Mm8jIngI$
>>>>>>>>>>> >
>>>>>>>>>>> [1704]PETSC ERROR: ---------------------  Stack Frames
>>>>>>>>>>> ------------------------------------
>>>>>>>>>>> [1704]PETSC ERROR: The line numbers in the error traceback are
>>>>>>>>>>> not always exact.
>>>>>>>>>>> [1704]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
>>>>>>>>>>>
>>>>>>>>>>> and then Mark got this:
>>>>>>>>>>> (new_py-env) 07:24 1 adams/landau-ex1-fix= ~/Codes/petsc2$ git
>>>>>>>>>>> grep hypre_ParCSRMatrixMigrate
>>>>>>>>>>> src/mat/impls/hypre/mhypre.c:
>>>>>>>>>>> PetscCallExternal(hypre_ParCSRMatrixMigrate, parcsr, hmem);
>>>>>>>>>>> src/mat/impls/hypre/mhypre.c:
>>>>>>>>>>> PetscCallExternal(hypre_ParCSRMatrixMigrate,parcsr, HYPRE_MEMORY_HOST);
>>>>>>>>>>>
>>>>>>>>>>> Any help debugging this would be appreciated. Thanks. ANd let me
>>>>>>>>>>> know if you need to be added to my Frontier project for access. I am on
>>>>>>>>>>> through this Friday.
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Feb 11, 2023 at 10:35 AM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>> I got on Frontier yesterday. Here's how it went. Had to use
>>>>>>>>>>> PrgEnv-cray to build petsc-hypre. PrgEnv-amd was having some problems. Also
>>>>>>>>>>> their default rocm/5.3.0 was problematic so backed off to rocm/
>>>>>>>>>>> 5.2.0.<
>>>>>>>>>>> https://urldefense.us/v3/__http://5.2.0.__;!!G2kpM7uM-TzIFchu!kUAHuocSRof5_aTlZtjYLNna1q86tr06UuRvUcmqBdCqWkovEx-X9Y-M2j0U4y8$>
>>>>>>>>>>> They did make 5.4.0 available yesterday but I stuck with 5.2.0. I got
>>>>>>>>>>> everything built and working. Scaling is excellent thus far. Performance is
>>>>>>>>>>> a little bit better than Crusher. And I am taking the scaling test up to
>>>>>>>>>>> higher concurrencies. Here's the comparison to Crusher. Same scaling test
>>>>>>>>>>> that we have been previously discussing.
>>>>>>>>>>> [image.png]
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 10, 2023 at 9:17 AM Yang, Ulrike Meier <
>>>>>>>>>>> yang11 at llnl.gov<mailto:yang11 at llnl.gov>> wrote:
>>>>>>>>>>> I haven’t seen this before. Is this from PETSc?
>>>>>>>>>>>
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Sent: Friday, February 10, 2023 09:14 AM
>>>>>>>>>>> To: Yang, Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Cc: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>;
>>>>>>>>>>> MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>> I am on Frontier today for 10 days. Building petsc-hypre. I do
>>>>>>>>>>> get this warning. ANything I should worry about?
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>                                      ***** WARNING *****
>>>>>>>>>>>   Branch "master" is specified, however remote branch
>>>>>>>>>>> "origin/master" also exists!
>>>>>>>>>>>   Proceeding with using the remote branch. To use the local
>>>>>>>>>>> branch (manually checkout local
>>>>>>>>>>>   branch and) - rerun configure with option
>>>>>>>>>>> --download-hypre-commit=HEAD)
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2023 at 7:09 PM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>> I should also say that the timestep includes other solves as
>>>>>>>>>>> well like advection and Helmholtz but the latter is not hyper, rather petsc
>>>>>>>>>>> Jacobi.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2023, 5:44 PM Yang, Ulrike Meier <yang11 at llnl.gov
>>>>>>>>>>> <mailto:yang11 at llnl.gov>> wrote:
>>>>>>>>>>> Great. Thanks for the new figure and explanation
>>>>>>>>>>> Ulrike
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Get Outlook for iOS<
>>>>>>>>>>> https://urldefense.us/v3/__https:/aka.ms/o0ukef__;!!G2kpM7uM-TzIFchu!iyppRHJeX4wNsqIy0mSCIAWwwmzeqaLTfY5V1Q98MhNlzvJp0jgUuPeYYGehcqSN$
>>>>>>>>>>> >
>>>>>>>>>>> ________________________________
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Sent: Tuesday, February 7, 2023 4:39:01 PM
>>>>>>>>>>> To: Yang, Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Cc: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>;
>>>>>>>>>>> MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV> <MFAdams at LBL.GOV<mailto:
>>>>>>>>>>> MFAdams at LBL.GOV>>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>> Hi Ulrike-
>>>>>>>>>>> In this scaling problem I use hypre to solve the
>>>>>>>>>>> pressure-Poisson problem in my projection method for incompressible
>>>>>>>>>>> Navier-Stokes. The preconditioner is set up once and re-used. This
>>>>>>>>>>> particular scaling problem is not time-dependent, that is, the grid is not
>>>>>>>>>>> moving so I don't have to redefine solver stencils, etc. I run 10 timesteps
>>>>>>>>>>> of this and average the time.
>>>>>>>>>>>
>>>>>>>>>>> When I did the July runs I thought it was anomalous data because
>>>>>>>>>>> it was slower. But I have seen this before where something may have been
>>>>>>>>>>> updated the previous 6 months and caused an uptick in performance. This
>>>>>>>>>>> anomaly was one of the reasons why I ran this recent test again besides
>>>>>>>>>>> making sure the new hypre release is performing the same. So, let's just
>>>>>>>>>>> forget the July data. Here is Feb 2022 vs. Jan 2023, with either boxes or
>>>>>>>>>>> nodes on x axis:
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 7, 2023 at 3:18 PM Yang, Ulrike Meier <
>>>>>>>>>>> yang11 at llnl.gov<mailto:yang11 at llnl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi David,
>>>>>>>>>>>
>>>>>>>>>>> I am still trying to understand the figures and your use of
>>>>>>>>>>> hypre:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When you are using hypre, do you just solve one system? Or is
>>>>>>>>>>> this a time dependent problem where you need to solve systems many times?
>>>>>>>>>>>
>>>>>>>>>>> If the latter do you set up the preconditioner once and reuse
>>>>>>>>>>> it, or do you set up every time?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Now I have some questions about the figures:
>>>>>>>>>>>
>>>>>>>>>>> It seems in this plot you are using 256 boxes per node and get
>>>>>>>>>>> better performance with hypre in July 2022 than in February 2022. Is this
>>>>>>>>>>> correct?
>>>>>>>>>>>
>>>>>>>>>>> Here performance in July 2022 is worse than in February 2022
>>>>>>>>>>> using 512 boxes per node:
>>>>>>>>>>>
>>>>>>>>>>> Performance is now back to previous better performance. I really
>>>>>>>>>>> wonder what happened in July. Do you have any idea? But the numbers of
>>>>>>>>>>> February 2022 are similar to what you have in the plot you sent below.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Sent: Wednesday, February 1, 2023 06:00 PM
>>>>>>>>>>> To: Yang, Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Cc: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>;
>>>>>>>>>>> MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'd be glad to show you the data in case you're interested.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 1, 2023 at 5:55 PM Yang, Ulrike Meier <
>>>>>>>>>>> yang11 at llnl.gov<mailto:yang11 at llnl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Never mind. I read your new message before the one you sent
>>>>>>>>>>> before. So, the figures are correct then
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Get Outlook for iOS<
>>>>>>>>>>> https://urldefense.us/v3/__https:/aka.ms/o0ukef__;!!G2kpM7uM-TzIFchu!jtnof7503SCD3sNKdbf-8RTND6Q2FRuyxk2zGyChnupBjnjN-TS7Fjzp1tA-1lor$
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>> ________________________________
>>>>>>>>>>>
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Sent: Wednesday, February 1, 2023 5:08:03 PM
>>>>>>>>>>> To: Yang, Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Cc: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>;
>>>>>>>>>>> MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV> <MFAdams at LBL.GOV<mailto:
>>>>>>>>>>> MFAdams at LBL.GOV>>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Just checked. That was a different scaling plot where the weak
>>>>>>>>>>> scaling started with N=2 nodes for the 512 box problem (not N=1). So, I can
>>>>>>>>>>> do the same for the new executable and see what we get. Should have
>>>>>>>>>>> labelled the previous figure with more detail because with log scale it is
>>>>>>>>>>> difficult to see the abscissa of the first datapoint
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In previous weak scaling I have put several on one plot and
>>>>>>>>>>> annotate with the starting node count for each curve:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 1, 2023 at 4:56 PM Yang, Ulrike Meier <
>>>>>>>>>>> yang11 at llnl.gov<mailto:yang11 at llnl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi David,
>>>>>>>>>>>
>>>>>>>>>>> I was referring to the figure below in my previous email.
>>>>>>>>>>>
>>>>>>>>>>> The timings are different, so you were probably running
>>>>>>>>>>> something a bit different, but it shows some nice improvement.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> Ulrike
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>
>>>>>>>>>>> Sent: Tuesday, July 26, 2022 4:35 PM
>>>>>>>>>>> To: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>; MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV>
>>>>>>>>>>> Cc: Yang, Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi David,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thank you for the scaling result which looks nice. The slight
>>>>>>>>>>> performance improvement was probably from recent code optimizations.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> About 64 integers, I assumed you were talking about hypre’s
>>>>>>>>>>> bigInt option on GPUs (? Correct me if wrong). I don’t see why you have to
>>>>>>>>>>> use it instead of mixedInt. I believe mixedInt can handle as big problems
>>>>>>>>>>> as bigInt can do (@Ulrike is it correct?). Having a 60B or 300B global size
>>>>>>>>>>> doesn’t seem to be an obstacle to me for mixedInt.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hope this makes sense.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> . -- .- .. .-.. / ..-. .-. --- -- / .-. ..- .. .--. . -. --. /
>>>>>>>>>>> .-.. ..
>>>>>>>>>>>
>>>>>>>>>>> Rui Peng Li
>>>>>>>>>>>
>>>>>>>>>>> Center for Applied Scientific Computing
>>>>>>>>>>>
>>>>>>>>>>> Lawrence Livermore National Laboratory
>>>>>>>>>>>
>>>>>>>>>>> P.O. Box 808, L-561 Livermore, CA 94551
>>>>>>>>>>>
>>>>>>>>>>> phone - (925) 422-6037,  email - li50 at llnl.gov<mailto:
>>>>>>>>>>> li50 at llnl.gov>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Date: Tuesday, July 26, 2022 at 3:40 PM
>>>>>>>>>>> To: MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV> <MFAdams at LBL.GOV
>>>>>>>>>>> <mailto:MFAdams at LBL.GOV>>
>>>>>>>>>>> Cc: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>, Yang,
>>>>>>>>>>> Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>> Ok, looks like my build has worked. It reproduces the weak
>>>>>>>>>>> scaling numbers that I had in Feb and in May and in fact the times are
>>>>>>>>>>> slightly better.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ruipeng/Ulrike: 64 integers seems to be the sticking point for
>>>>>>>>>>> me since my runs are high d.o.f. and they're only going to get bigger so
>>>>>>>>>>> having hypre run with 64 int on GPU is probably needed. The largest problem
>>>>>>>>>>> that I run for my scaling test on Crusher is about 6B dof on 128 nodes. On
>>>>>>>>>>> Frontier we will certainly be 10x that problem size and probably 50x.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Mark: I still would like to get an official build from you when
>>>>>>>>>>> you get back from vacation just to have that in a safe place and to make
>>>>>>>>>>> sure we are on the same page.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here's the configure file I used:
>>>>>>>>>>>
>>>>>>>>>>> #!/usr/bin/python3
>>>>>>>>>>> if __name__ == '__main__':
>>>>>>>>>>>   import sys
>>>>>>>>>>>   import os
>>>>>>>>>>>   sys.path.insert(0, os.path.abspath('config'))
>>>>>>>>>>>   import configure
>>>>>>>>>>>   configure_options = [
>>>>>>>>>>>     '--download-hypre',
>>>>>>>>>>>     '--download-hypre-commit=master',
>>>>>>>>>>>     '--download-hypre-configure-arguments=--enable-bigint=no
>>>>>>>>>>> --enable-mixedint=yes',
>>>>>>>>>>>
>>>>>>>>>>> '--prefix=/gpfs/alpine/world-shared/geo127/petsc_treb/arch-crusher-amd-opt-int64-master',
>>>>>>>>>>>     '--with-64-bit-indices=1',
>>>>>>>>>>>     '--with-cc=cc',
>>>>>>>>>>>     '--with-cxx=CC',
>>>>>>>>>>>     '--with-debugging=0',
>>>>>>>>>>>     '--with-fc=ftn',
>>>>>>>>>>>     '--with-hip',
>>>>>>>>>>>     '--with-hipc=hipcc',
>>>>>>>>>>>     '--with-mpiexec=srun',
>>>>>>>>>>>     'LIBS=-L/opt/cray/pe/mpich/8.1.16/gtl/lib -lmpi_gtl_hsa',
>>>>>>>>>>>     'PETSC_ARCH=arch-olcf-crusher-amd-opt-int64-master',
>>>>>>>>>>>
>>>>>>>>>>> 'PETSC_DIR=/gpfs/alpine/world-shared/geo127/petsc_treb/petsc',
>>>>>>>>>>>   ]
>>>>>>>>>>>   configure.petsc_configure(configure_options)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And here's the module list:
>>>>>>>>>>>
>>>>>>>>>>> Currently Loaded Modules:
>>>>>>>>>>>   1) craype-x86-trento
>>>>>>>>>>>   2) libfabric/1.15.0.0<
>>>>>>>>>>> https://urldefense.us/v3/__http:/1.15.0.0__;!!G2kpM7uM-TzIFchu!n794qlYekpYcOxHOM02fRuGjxyA6-PY6Bp_NJGcse4LvoqXq878zIvHJeI6a5Rk$
>>>>>>>>>>> >
>>>>>>>>>>>   3) craype-network-ofi
>>>>>>>>>>>   4) perftools-base/22.05.0
>>>>>>>>>>>   5) xpmem/2.4.4-2.3_2.12__gff0e1d9.shasta
>>>>>>>>>>>   6) cray-pmi/6.1.2
>>>>>>>>>>>   7) rocm/5.1.0
>>>>>>>>>>>   8) subversion/1.14.1
>>>>>>>>>>>   9) emacs/28.1
>>>>>>>>>>>  10) amd/5.1.0
>>>>>>>>>>>  11) craype/2.7.15
>>>>>>>>>>>  12) cray-dsmml/0.2.2
>>>>>>>>>>>  13) cray-mpich/8.1.16
>>>>>>>>>>>  14) cray-libsci/21.08.1.2<
>>>>>>>>>>> https://urldefense.us/v3/__http:/21.08.1.2__;!!G2kpM7uM-TzIFchu!n794qlYekpYcOxHOM02fRuGjxyA6-PY6Bp_NJGcse4LvoqXq878zIvHJSW0Wh3g$
>>>>>>>>>>> >
>>>>>>>>>>>  15) PrgEnv-amd/8.3.3
>>>>>>>>>>>  16) xalt/1.3.0
>>>>>>>>>>>  17) DefApps/default
>>>>>>>>>>>  18) cray-hdf5-parallel/1.12.1.1<
>>>>>>>>>>> https://urldefense.us/v3/__http:/1.12.1.1__;!!G2kpM7uM-TzIFchu!n794qlYekpYcOxHOM02fRuGjxyA6-PY6Bp_NJGcse4LvoqXq878zIvHJMJWAYCE$
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 8:42 PM Mark Adams <mfadams at lbl.gov
>>>>>>>>>>> <mailto:mfadams at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> '--download-hypre-commit=master',
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  You might want:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> '--download-hypre-commit=origin/master',
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But, You should ask questions on the mailing list petsc-maint <
>>>>>>>>>>> petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>> (not
>>>>>>>>>>> archived).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>> ps, I am on vacation and will be back on the1st
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 9:00 PM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I am not getting anywhere with this. I'll have to wait for Mark
>>>>>>>>>>> to do the petsc build with hypre.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I tried the following to get the hypre master branch but I am
>>>>>>>>>>> not sure if this is the right incantation:
>>>>>>>>>>>
>>>>>>>>>>> '--download-hypre',
>>>>>>>>>>>
>>>>>>>>>>> '--download-hypre-commit=master',
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I did get a build with that but still get same problem with
>>>>>>>>>>> scaling.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here's my configure script:
>>>>>>>>>>>
>>>>>>>>>>> #!/usr/bin/python3
>>>>>>>>>>> if __name__ == '__main__':
>>>>>>>>>>>   import sys
>>>>>>>>>>>   import os
>>>>>>>>>>>   sys.path.insert(0, os.path.abspath('config'))
>>>>>>>>>>>   import configure
>>>>>>>>>>>   configure_options = [
>>>>>>>>>>>     '--download-hypre',
>>>>>>>>>>>     '--download-hypre-commit=master',
>>>>>>>>>>>     '--download-hypre-configure-arguments=--enable-bigint=no
>>>>>>>>>>> --enable-mixedint=yes',
>>>>>>>>>>>
>>>>>>>>>>> '--prefix=/gpfs/alpine/world-shared/geo127/petsc_treb/arch-crusher-amd-opt-int64-master',
>>>>>>>>>>>     '--with-64-bit-indices=1',
>>>>>>>>>>>     '--with-cc=cc',
>>>>>>>>>>>     '--with-cxx=CC',
>>>>>>>>>>>     '--with-debugging=0',
>>>>>>>>>>>     '--with-fc=ftn',
>>>>>>>>>>>     '--with-hip',
>>>>>>>>>>>     '--with-hipc=hipcc',
>>>>>>>>>>>     '--with-mpiexec=srun',
>>>>>>>>>>>     'LIBS=-L/opt/cray/pe/mpich/8.1.16/gtl/lib -lmpi_gtl_hsa',
>>>>>>>>>>>     'PETSC_ARCH=arch-olcf-crusher-amd-opt-int64-master',
>>>>>>>>>>>   ]
>>>>>>>>>>>   configure.petsc_configure(configure_options)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Currently Loaded Modules:
>>>>>>>>>>>   1) craype-x86-trento
>>>>>>>>>>>   2) libfabric/1.15.0.0<
>>>>>>>>>>> https://urldefense.us/v3/__http:/1.15.0.0__;!!G2kpM7uM-TzIFchu!n794qlYekpYcOxHOM02fRuGjxyA6-PY6Bp_NJGcse4LvoqXq878zIvHJeI6a5Rk$
>>>>>>>>>>> >
>>>>>>>>>>>   3) craype-network-ofi
>>>>>>>>>>>   4) perftools-base/22.05.0
>>>>>>>>>>>   5) xpmem/2.4.4-2.3_2.12__gff0e1d9.shasta
>>>>>>>>>>>   6) cray-pmi/6.1.2
>>>>>>>>>>>   7) emacs/27.2
>>>>>>>>>>>   8) rocm/5.1.0
>>>>>>>>>>>   9) subversion/1.14.1
>>>>>>>>>>>  10) amd/5.1.0
>>>>>>>>>>>  11) craype/2.7.15
>>>>>>>>>>>  12) cray-dsmml/0.2.2
>>>>>>>>>>>  13) cray-mpich/8.1.16
>>>>>>>>>>>  14) cray-libsci/21.08.1.2<
>>>>>>>>>>> https://urldefense.us/v3/__http:/21.08.1.2__;!!G2kpM7uM-TzIFchu!n794qlYekpYcOxHOM02fRuGjxyA6-PY6Bp_NJGcse4LvoqXq878zIvHJSW0Wh3g$
>>>>>>>>>>> >
>>>>>>>>>>>  15) PrgEnv-amd/8.3.3
>>>>>>>>>>>  16) xalt/1.3.0
>>>>>>>>>>>  17) DefApps/default
>>>>>>>>>>>  18) cray-hdf5-parallel/1.12.1.1<
>>>>>>>>>>> https://urldefense.us/v3/__http:/1.12.1.1__;!!G2kpM7uM-TzIFchu!n794qlYekpYcOxHOM02fRuGjxyA6-PY6Bp_NJGcse4LvoqXq878zIvHJMJWAYCE$
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 11:20 AM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> that was wrong mpich. I got much further in the configure. How
>>>>>>>>>>> do I know if I got the master branch of hypre?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 10:46 AM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I use the following configure:
>>>>>>>>>>>
>>>>>>>>>>> #!/usr/bin/python3
>>>>>>>>>>> if __name__ == '__main__':
>>>>>>>>>>>   import sys
>>>>>>>>>>>   import os
>>>>>>>>>>>   sys.path.insert(0, os.path.abspath('config'))
>>>>>>>>>>>   import configure
>>>>>>>>>>>   configure_options = [
>>>>>>>>>>>     '--download-hypre',
>>>>>>>>>>>     '--download-hypre-commit=master',
>>>>>>>>>>>     '--download-hypre-configure-arguments=--enable-bigint=no
>>>>>>>>>>> --enable-mixedint=yes',
>>>>>>>>>>>
>>>>>>>>>>> '--prefix=/gpfs/alpine/world-shared/geo127/petsc_treb/arch-crusher-cray-opt-int64-master',
>>>>>>>>>>>     '--with-64-bit-indices=1',
>>>>>>>>>>>     '--with-cc=cc',
>>>>>>>>>>>     '--with-cxx=CC',
>>>>>>>>>>>     '--with-debugging=0',
>>>>>>>>>>>     '--with-fc=ftn',
>>>>>>>>>>>     '--with-hip',
>>>>>>>>>>>     '--with-hipc=hipcc',
>>>>>>>>>>>     '--with-mpiexec=srun',
>>>>>>>>>>>     'LIBS=-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa',
>>>>>>>>>>>     'PETSC_ARCH=arch-olcf-crusher-cray-opt-int64-master',
>>>>>>>>>>>   ]
>>>>>>>>>>>   configure.petsc_configure(configure_options)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and get:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>                          Configuring PETSc to compile on your
>>>>>>>>>>> system
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>                                                               *****
>>>>>>>>>>> WARNING: Using default optimization C flags -O
>>>>>>>>>>>                                                                      You
>>>>>>>>>>> might consider manually setting optimal optimization flags for your system
>>>>>>>>>>> with
>>>>>>>>>>>  COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
>>>>>>>>>>> examples
>>>>>>>>>>>  =============================================================================================
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>                                                               *****
>>>>>>>>>>> WARNING: Using default Cxx optimization flags -O
>>>>>>>>>>>                                                                      You
>>>>>>>>>>> might consider manually setting optimal optimization flags for your system
>>>>>>>>>>> with
>>>>>>>>>>>  CXXOPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
>>>>>>>>>>> examples
>>>>>>>>>>>  =============================================================================================
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>                                                               *****
>>>>>>>>>>> WARNING: Using default FORTRAN optimization flags -O
>>>>>>>>>>>                                                                      You
>>>>>>>>>>> might consider manually setting optimal optimization flags for your system
>>>>>>>>>>> with
>>>>>>>>>>>  FOPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
>>>>>>>>>>> examples
>>>>>>>>>>>  =============================================================================================
>>>>>>>>>>>
>>>>>>>>>>> =============================================================================================
>>>>>>>>>>>                                                               *****
>>>>>>>>>>> WARNING: Using default HIP optimization flags -g -O3
>>>>>>>>>>>                                                                      You
>>>>>>>>>>> might consider manually setting optimal optimization flags for your system
>>>>>>>>>>> with
>>>>>>>>>>>  HIPOPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
>>>>>>>>>>> examples
>>>>>>>>>>>  =============================================================================================
>>>>>>>>>>>                                                         TESTING:
>>>>>>>>>>> checkFortranLibraries from
>>>>>>>>>>> config.compilers(config/BuildSystem/config/compilers.py:835)
>>>>>>>>>>>
>>>>>>>>>>>  *******************************************************************************
>>>>>>>>>>>                     OSError while running ./configure
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------
>>>>>>>>>>> Cannot run executables created with FC. If this machine uses a
>>>>>>>>>>> batch system
>>>>>>>>>>> to submit jobs you will need to configure using ./configure with
>>>>>>>>>>> the additional option  --with-batch.
>>>>>>>>>>> Otherwise there is problem with the compilers. Can you compile
>>>>>>>>>>> and run code with your compiler 'ftn'?
>>>>>>>>>>>
>>>>>>>>>>> *******************************************************************************
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 9:50 AM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think recent builds have been hypre v2.25
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 9:49 AM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> so instead of just
>>>>>>>>>>>
>>>>>>>>>>>     '--download-hypre',
>>>>>>>>>>>
>>>>>>>>>>> add
>>>>>>>>>>>
>>>>>>>>>>>     '--download-hypre',
>>>>>>>>>>>     '--download-hypre-commit=master',
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ???
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 21, 2022 at 9:47 AM Li, Rui Peng <li50 at llnl.gov
>>>>>>>>>>> <mailto:li50 at llnl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> As Ulrike said, AMD recently found bugs regarding the bigInt
>>>>>>>>>>> issue, which have been fixed in the current master. I suggest using the
>>>>>>>>>>> master branch of hypre if possible.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> . -- .- .. .-.. / ..-. .-. --- -- / .-. ..- .. .--. . -. --. /
>>>>>>>>>>> .-.. ..
>>>>>>>>>>>
>>>>>>>>>>> Rui Peng Li
>>>>>>>>>>>
>>>>>>>>>>> Center for Applied Scientific Computing
>>>>>>>>>>>
>>>>>>>>>>> Lawrence Livermore National Laboratory
>>>>>>>>>>>
>>>>>>>>>>> P.O. Box 808, L-561 Livermore, CA 94551
>>>>>>>>>>>
>>>>>>>>>>> phone - (925) 422-6037,  email - li50 at llnl.gov<mailto:
>>>>>>>>>>> li50 at llnl.gov>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Yang, Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov
>>>>>>>>>>> >>
>>>>>>>>>>> Date: Thursday, July 21, 2022 at 9:41 AM
>>>>>>>>>>> To: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>, Li, Rui Peng <li50 at llnl.gov<mailto:
>>>>>>>>>>> li50 at llnl.gov>>
>>>>>>>>>>> Cc: MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV> <MFAdams at LBL.GOV
>>>>>>>>>>> <mailto:MFAdams at LBL.GOV>>
>>>>>>>>>>> Subject: RE: Frontier
>>>>>>>>>>>
>>>>>>>>>>> Actually, I think it was 2000 nodes!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Yang, Ulrike Meier
>>>>>>>>>>> Sent: Thursday, July 21, 2022 9:40 AM
>>>>>>>>>>> To: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>; Li, Rui Peng <li50 at llnl.gov<mailto:
>>>>>>>>>>> li50 at llnl.gov>>
>>>>>>>>>>> Cc: MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV>
>>>>>>>>>>> Subject: RE: Frontier
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Which version of hypre are you using for this?
>>>>>>>>>>>
>>>>>>>>>>> We recently found one bug in the mixed-int version, however that
>>>>>>>>>>> should have been an issue also in your previous runs that apparently were
>>>>>>>>>>> working.
>>>>>>>>>>>
>>>>>>>>>>> Note that recent runs by AMD on Frontier with hypre were
>>>>>>>>>>> successful on more than 200 nodes using mixed-int, so we should be able to
>>>>>>>>>>> get this to work somehow for you guys. They also found the bug in mixed-int.
>>>>>>>>>>>
>>>>>>>>>>> Ulrike
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Sent: Thursday, July 21, 2022 9:30 AM
>>>>>>>>>>> To: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>
>>>>>>>>>>> Cc: MFAdams at LBL.GOV<mailto:MFAdams at LBL.GOV>; Yang, Ulrike Meier
>>>>>>>>>>> <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Ruipeng and Ulrike
>>>>>>>>>>>
>>>>>>>>>>> You asked if we need 64 int for gpus and I think we definitely
>>>>>>>>>>> do need it. Currently I cannot scale past that 2B degree of freedom mark
>>>>>>>>>>> that you mentioned. I am not sure what happened between Mark's Cray build
>>>>>>>>>>> in February and his amd build in May but currently I cannot scale past 32
>>>>>>>>>>> nodes on Crusher. This is unfortunate because given the success over the
>>>>>>>>>>> past 6 months I have told ECP that we are fully ready for Frontier. Now, we
>>>>>>>>>>> are not. Hopefully we can figure this out pretty soon and be ready to take
>>>>>>>>>>> a shot on Frontier when they let us on.
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 18, 2022 at 5:03 PM Li, Rui Peng <li50 at llnl.gov
>>>>>>>>>>> <mailto:li50 at llnl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Building with unified memory will *not* change the default
>>>>>>>>>>> parameters of AMG. Are you using the master branch of hypre or some release
>>>>>>>>>>> version? I think our previous fix should be included in the latest release.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please let me know if I can further help
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> . -- .- .. .-.. / ..-. .-. --- -- / .-. ..- .. .--. . -. --. /
>>>>>>>>>>> .-.. ..
>>>>>>>>>>>
>>>>>>>>>>> Rui Peng Li
>>>>>>>>>>>
>>>>>>>>>>> Center for Applied Scientific Computing
>>>>>>>>>>>
>>>>>>>>>>> Lawrence Livermore National Laboratory
>>>>>>>>>>>
>>>>>>>>>>> P.O. Box 808, L-561 Livermore, CA 94551
>>>>>>>>>>>
>>>>>>>>>>> phone - (925) 422-6037,  email - li50 at llnl.gov<mailto:
>>>>>>>>>>> li50 at llnl.gov>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Mark Adams <mfadams at lbl.gov<mailto:mfadams at lbl.gov>>
>>>>>>>>>>> Date: Monday, July 18, 2022 at 1:55 PM
>>>>>>>>>>> To: David Trebotich <dptrebotich at lbl.gov<mailto:
>>>>>>>>>>> dptrebotich at lbl.gov>>
>>>>>>>>>>> Cc: Li, Rui Peng <li50 at llnl.gov<mailto:li50 at llnl.gov>>, Yang,
>>>>>>>>>>> Ulrike Meier <yang11 at llnl.gov<mailto:yang11 at llnl.gov>>
>>>>>>>>>>> Subject: Re: Frontier
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 18, 2022 at 4:35 PM David Trebotich <
>>>>>>>>>>> dptrebotich at lbl.gov<mailto:dptrebotich at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> When I run with Mark's newest build then I get stuck in the nnz
>>>>>>>>>>> bin counts for the first solve (proj_mac). Here's the stack:
>>>>>>>>>>>
>>>>>>>>>>> [0]PETSC ERROR: #1 jac->setup() at
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/hypre/hypre.c:420
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is the same place where we got this hypre error "(12)"
>>>>>>>>>>> before.
>>>>>>>>>>>
>>>>>>>>>>> Recall this error message means that there is a zero row in the
>>>>>>>>>>> matrix.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I may have been using the master branch of hypre when I built
>>>>>>>>>>> that working version.
>>>>>>>>>>>
>>>>>>>>>>> Maybe this branch was fixed to accept zero rows?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Rui Peng: I am building with UVM now. Does that change the
>>>>>>>>>>> defaults in hypre?
>>>>>>>>>>>
>>>>>>>>>>> For instance, does hypre use Falgout coursening if UVM is
>>>>>>>>>>> available?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [0]PETSC ERROR: #2 PCSetUp_HYPRE() at
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/hypre/hypre.c:237
>>>>>>>>>>> [0]PETSC ERROR: #3 PCSetUp() at
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:949
>>>>>>>>>>> [0]PETSC ERROR: #4 KSPSetUp() at
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:314
>>>>>>>>>>> [0]PETSC ERROR: #5 KSPSolve_Private() at
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:792
>>>>>>>>>>> [0]PETSC ERROR: #6 KSPSolve() at
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1061
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And here's my .petscrc:
>>>>>>>>>>>
>>>>>>>>>>> -help
>>>>>>>>>>>
>>>>>>>>>>> -proj_mac_mat_type hypre
>>>>>>>>>>> -proj_mac_pc_type hypre
>>>>>>>>>>> -proj_mac_pc_hypre_type boomeramg
>>>>>>>>>>> -proj_mac_pc_hypre_boomeramg_no_CF
>>>>>>>>>>> -proj_mac_pc_hypre_boomeramg_agg_nl 0
>>>>>>>>>>> -proj_mac_pc_hypre_boomeramg_coarsen_type PMIS
>>>>>>>>>>> -proj_mac_pc_hypre_boomeramg_interp_type ext+i
>>>>>>>>>>> -proj_mac_pc_hypre_boomeramg_print_statistics
>>>>>>>>>>> -proj_mac_pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi
>>>>>>>>>>> -proj_mac_pc_hypre_SetSpGemmUseCusparse 0
>>>>>>>>>>>
>>>>>>>>>>> -proj_mac_ksp_type gmres
>>>>>>>>>>> -proj_mac_ksp_max_it 50
>>>>>>>>>>> -proj_mac_ksp_rtol 1.e-12
>>>>>>>>>>> -proj_mac_ksp_atol 1.e-30
>>>>>>>>>>>
>>>>>>>>>>> -use_gpu_aware_mpi 0
>>>>>>>>>>>
>>>>>>>>>>> -info
>>>>>>>>>>> -log_view
>>>>>>>>>>> -history PETSc.history
>>>>>>>>>>> -options_left
>>>>>>>>>>>
>>>>>>>>>>> -visc_pc_type jacobi
>>>>>>>>>>>
>>>>>>>>>>> -visc_pc_hypre_type boomeramg
>>>>>>>>>>> -visc_ksp_type gmres
>>>>>>>>>>> -visc_ksp_max_it 50
>>>>>>>>>>> -visc_ksp_rtol 1.e-12
>>>>>>>>>>>
>>>>>>>>>>> -diff_pc_type jacobi
>>>>>>>>>>> -diff_pc_hypre_type boomeramg
>>>>>>>>>>> -diff_ksp_type gmres
>>>>>>>>>>> -diff_ksp_max_it 50
>>>>>>>>>>> -diff_ksp_rtol 1.e-6
>>>>>>>>>>>
>>>>>>>>>>> -proj_mac_ksp_converged_reason
>>>>>>>>>>> -visc_ksp_converged_reason
>>>>>>>>>>> -diff_ksp_converged_reason
>>>>>>>>>>> -proj_mac_ksp_norm_type unpreconditioned
>>>>>>>>>>> -diff_ksp_norm_type unpreconditioned
>>>>>>>>>>> -visc_ksp_norm_type unpreconditioned
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 18, 2022 at 1:30 PM Mark Adams <mfadams at lbl.gov
>>>>>>>>>>> <mailto:mfadams at lbl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 18, 2022 at 4:18 PM Li, Rui Peng <li50 at llnl.gov
>>>>>>>>>>> <mailto:li50 at llnl.gov>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Yes, there is no need for enable-unified-memory, unless you want
>>>>>>>>>>> to use non-GPU supported parameter of AMG (such as Falgout coarsening)
>>>>>>>>>>> which needs unified memory since it will run on CPUs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Got it, Will not use UVM.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> UVM is unified memory. Our expert from AMD told us to not use
>>>>>>>>>>> unified memory yet. Maybe it's working now, but I haven't tried.
>>>>>>>>>>>
>>>>>>>>>>> 64-bit integer: Sorry, I did not make it clear. "mixed-int" is a
>>>>>>>>>>> more efficient approach for problems with > 2B dofs where the local integer
>>>>>>>>>>> type is kept in 32-bit while the global one is 64-bit. This is the only way
>>>>>>>>>>> we currently support on GPUs. hypre also has "--enable-big-int" which has
>>>>>>>>>>> all the integers (local and global) in 64-bit, which we don't have on GPUs.
>>>>>>>>>>> For some users, it is difficult for their code to handle two integer types
>>>>>>>>>>> (in mixed-int), so they prefer the old "big-int" approach. That's why I was
>>>>>>>>>>> asking. If  "mixed-int" works for you, that's ideal. No need to bother.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I see. I only care about the interface so the current parameters
>>>>>>>>>>> are fine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --enable-bigint=no --enable-mixedint=yes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think PETSc should always use this, with 64 bit ints, because
>>>>>>>>>>> we only care about the interface and I trust the local problem will be < 2B.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ----------------------
>>>>>>>>>>> David Trebotich
>>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>>> Computational Research Division
>>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>>> treb at lbl.gov<mailto:treb at lbl.gov>
>>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> ----------------------
>>>>>>>>>> David Trebotich
>>>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>>>> Computational Research Division
>>>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>>>> treb at lbl.gov
>>>>>>>>>> (510) 486-5984 office
>>>>>>>>>> (510) 384-6868 mobile
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ----------------------
>>>>>>>> David Trebotich
>>>>>>>> Lawrence Berkeley National Laboratory
>>>>>>>> Computational Research Division
>>>>>>>> Applied Numerical Algorithms Group
>>>>>>>> treb at lbl.gov
>>>>>>>> (510) 486-5984 office
>>>>>>>> (510) 384-6868 mobile
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ----------------------
>>>>>> David Trebotich
>>>>>> Lawrence Berkeley National Laboratory
>>>>>> Computational Research Division
>>>>>> Applied Numerical Algorithms Group
>>>>>> treb at lbl.gov
>>>>>> (510) 486-5984 office
>>>>>> (510) 384-6868 mobile
>>>>>>
>>>>>
>>>>
>>>> --
>>>> ----------------------
>>>> David Trebotich
>>>> Lawrence Berkeley National Laboratory
>>>> Computational Research Division
>>>> Applied Numerical Algorithms Group
>>>> treb at lbl.gov
>>>> (510) 486-5984 office
>>>> (510) 384-6868 mobile
>>>>
>>>
>>>
>>> --
>>> ----------------------
>>> David Trebotich
>>> Lawrence Berkeley National Laboratory
>>> Computational Research Division
>>> Applied Numerical Algorithms Group
>>> treb at lbl.gov
>>> (510) 486-5984 office
>>> (510) 384-6868 mobile
>>>
>>
>
> --
> ----------------------
> David Trebotich
> Lawrence Berkeley National Laboratory
> Computational Research Division
> Applied Numerical Algorithms Group
> treb at lbl.gov
> (510) 486-5984 office
> (510) 384-6868 mobile
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230227/8a1c6ae2/attachment-0001.html>