[petsc-dev] spock

Satish Balay balay at mcs.anl.gov
Fri Dec 10 10:46:23 CST 2021


On Fri, 10 Dec 2021, Mark Adams wrote:

> I was able to run a parallel test manually.
> 
> Do you have any thoughts on Kokkos?
> '--with-kokkos-hip-arch=VEGA908',

Configure sets that automatically from --with-hip-arch - which is auto-detected from 'rocminfo' [which appears to work on spock]

Perhaps it should also set --with-magma-gputarget the same way.

> On Fri, Dec 10, 2021 at 11:08 AM Mark Adams <mfadams at lbl.gov> wrote:
> 
> > It seems to be hanging on the 2 processor test.
> > I'll try running jobs manually.


Hm - perhaps the srun command you need is different?

'--with-mpiexec=srun -p ecp -N 1 -A csc314 -t 00:10:00'

Satish

> >
> > On Fri, Dec 10, 2021 at 9:34 AM Satish Balay <balay at mcs.anl.gov> wrote:
> >
> >> Merged now. And the following now works [for me].
> >>
> >>  1025  git fetch -p
> >>  1026  git checkout origin/main
> >>  1027  ./config/examples/arch-olcf-spock.py && make
> >>  1028  MPIR_CVAR_GPU_EAGER_DEVICE_MEM=0 MPICH_GPU_SUPPORT_ENABLED=1
> >> MPICH_SMP_SINGLE_COPY_MODE=CMA make check
> >>
> >> Satish
> >>
> >> On Fri, 10 Dec 2021, Satish Balay via petsc-dev wrote:
> >>
> >> > Works for me [per instructions in balay/update-spock,
> >> config/examples/arch-olcf-spock.py] with main - without these additional
> >> options
> >> >
> >> > I'll go ahead and merge in balay/update-spock
> >> >
> >> > Satish
> >> >
> >> > -----
> >> >
> >> >  1009  git fetch -p
> >> >  1015  module load emacs
> >> >  1016  module load rocm/4.3.0
> >> >  1018  git reset --hard
> >> >  1019  git checkout origin/main
> >> >  1020  git merge origin/balay/update-spock
> >> >  1021  ./config/examples/arch-olcf-spock.py && make
> >> >
> >> >
> >> >
> >> > [balay at login2.spock petsc]$ MPIR_CVAR_GPU_EAGER_DEVICE_MEM=0
> >> MPICH_GPU_SUPPORT_ENABLED=1 MPICH_SMP_SINGLE_COPY_MODE=CMA make check
> >> > Running check examples to verify correct installation
> >> > Using PETSC_DIR=/autofs/nccs-svm1_home1/balay/petsc and
> >> PETSC_ARCH=arch-olcf-spock
> >> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI
> >> process
> >> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI
> >> processes
> >> > C/C++ example src/snes/tutorials/ex3k run successfully with
> >> kokkos-kernels
> >> > *******************Error detected during compile or
> >> link!*******************
> >> > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> >> > /ccs/home/balay/petsc/src/snes/tutorials ex5f
> >> > *********************************************************
> >> > ftn -fPIC   -fPIC    -I/autofs/nccs-svm1_home1/balay/petsc/include
> >> -I/autofs/nccs-svm1_home1/balay/petsc/arch-olcf-spock/include
> >> -I/opt/rocm-4.3.0/include     ex5f.F90
> >> -Wl,-rpath,/autofs/nccs-svm1_home1/balay/petsc/arch-olcf-spock/lib
> >> -L/autofs/nccs-svm1_home1/balay/petsc/arch-olcf-spock/lib
> >> -Wl,-rpath,/autofs/nccs-svm1_home1/balay/petsc/arch-olcf-spock/lib
> >> -L/autofs/nccs-svm1_home1/balay/petsc/arch-olcf-spock/lib
> >> -Wl,-rpath,/opt/rocm-4.3.0/lib -L/opt/rocm-4.3.0/lib
> >> -Wl,-rpath,/opt/cray/pe/mpich/8.1.10/gtl/lib
> >> -L/opt/cray/pe/mpich/8.1.10/gtl/lib
> >> -Wl,-rpath,/opt/cray/pe/gcc/8.1.0/snos/lib64
> >> -L/opt/cray/pe/gcc/8.1.0/snos/lib64 -Wl,-rpath,/opt/cray/pe/libsci/
> >> 21.08.1.2/CRAY/9.0/x86_64/lib -L/opt/cray/pe/libsci/
> >> 21.08.1.2/CRAY/9.0/x86_64/lib
> >> -Wl,-rpath,/opt/cray/pe/mpich/8.1.10/ofi/cray/10.0/lib
> >> -L/opt/cray/pe/mpich/8.1.10/ofi/cray/10.0/lib
> >> -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib
> >> -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,-rpath,/opt/cray/pe/pmi/6.0.14/lib
> >> -L/opt/cray/pe/pmi/6
> >> >  .0.14/li
> >> >  b -Wl,-rpath,/opt/cray/pe/cce/12.0.3/cce/x86_64/lib
> >> -L/opt/cray/pe/cce/12.0.3/cce/x86_64/lib
> >> -Wl,-rpath,/opt/cray/xpmem/2.2.40-2.1_2.44__g3cf3325.shasta/lib64
> >> -L/opt/cray/xpmem/2.2.40-2.1_2.44__g3cf3325.shasta/lib64
> >> -Wl,-rpath,/opt/cray/pe/cce/12.0.3/cce-clang/x86_64/lib/clang/12.0.0/lib/linux
> >> -L/opt/cray/pe/cce/12.0.3/cce-clang/x86_64/lib/clang/12.0.0/lib/linux
> >> -Wl,-rpath,/opt/cray/pe/gcc/8.1.0/snos/lib/gcc/x86_64-suse-linux/8.1.0
> >> -L/opt/cray/pe/gcc/8.1.0/snos/lib/gcc/x86_64-suse-linux/8.1.0
> >> -Wl,-rpath,/opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-unknown-linux-gnu/lib
> >> -L/opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-unknown-linux-gnu/lib
> >> -lpetsc -lmagma -lkokkoskernels -lkokkoscontainers -lkokkoscore -lhipsparse
> >> -lhipblas -lrocsparse -lrocsolver -lrocblas -lrocrand -lamdhip64 -lstdc++
> >> -ldl -lmpi_gtl_hsa -lmpifort_cray -lmpi_cray -ldsmml -lpmi -lpmi2 -lxpmem
> >> -lpgas-shmem -lquadmath -lmodules -lfi -lcraymath -lf -lu -lcsup -lgfortran
> >> -lpthread -lgcc_eh -lm -lclang_rt.craypg
> >> >  o-x86_64
> >> >   -lclang_rt.builtins-x86_64 -lquadmath -lstdc++ -ldl -lmpi_gtl_hsa -o
> >> ex5f/opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-pc-linux-gnu/bin/ld:
> >> warning: alignment 128 of symbol
> >> `$host_init$$runtime_init_for_iso_c_binding$iso_c_binding_' in
> >> /opt/cray/pe/cce/12.0.3/cce/x86_64/lib/libmodules.so is smaller than 256 in
> >> /tmp/pe_202599/ex5f_1.o
> >> > /opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-pc-linux-gnu/bin/ld:
> >> warning: alignment 64 of symbol `$data_init$iso_c_binding_' in
> >> /opt/cray/pe/cce/12.0.3/cce/x86_64/lib/libmodules.so is smaller than 256 in
> >> /tmp/pe_202599/ex5f_1.o
> >> > Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI
> >> process
> >> > Completed test examples
> >> > [balay at login2.spock petsc]$
> >> >
> >> >
> >> > On Fri, 10 Dec 2021, Mark Adams wrote:
> >> >
> >> > > FWIW,  here is my current status.
> >> > >
> >> > > 08:08 main= spock:/gpfs/alpine/csc314/scratch/adams/petsc$ make
> >> > > PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc
> >> > > PETSC_ARCH=arch-olcf-spock check
> >> > > Running check examples to verify correct installation
> >> > > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and
> >> > > PETSC_ARCH=arch-olcf-spock
> >> > > Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI
> >> process
> >> > > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> >> > > lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
> >> > >     0 KSP Residual norm 0.0406612
> >> > >     1 KSP Residual norm 0.036923
> >> > >     2 KSP Residual norm 0.0191849
> >> > >     3 KSP Residual norm 0.00201589
> >> > >     4 KSP Residual norm 0.000376045
> >> > >     5 KSP Residual norm 4.2974e-05
> >> > >     6 KSP Residual norm 5.96585e-06
> >> > >     7 KSP Residual norm 4.5398e-07
> >> > >     8 KSP Residual norm 6.30474e-08
> >> > >     9 KSP Residual norm 5.55518e-09
> >> > >    10 KSP Residual norm 6.180e-10
> >> > >    11 KSP Residual norm 6.211e-11
> >> > >   Linear solve converged due to CONVERGED_RTOL iterations 11
> >> > >     0 KSP Residual norm 3.32845e-06
> >> > >     1 KSP Residual norm 9.0003e-07
> >> > >     2 KSP Residual norm 1.32594e-07
> >> > >     3 KSP Residual norm 1.49857e-08
> >> > >     4 KSP Residual norm 1.31887e-09
> >> > >     5 KSP Residual norm 2.105e-10
> >> > >     6 KSP Residual norm 2.827e-11
> >> > >     7 KSP Residual norm < 1.e-11
> >> > >     8 KSP Residual norm < 1.e-11
> >> > >     9 KSP Residual norm < 1.e-11
> >> > >    10 KSP Residual norm < 1.e-11
> >> > >   Linear solve converged due to CONVERGED_RTOL iterations 10
> >> > > Number of SNES iterations = 2
> >> > > Possible error running C/C++ src/snes/tutorials/ex19 with 2 MPI
> >> processes
> >> > > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> >> > > lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
> >> > >     0 KSP Residual norm 0.0406612
> >> > >     1 KSP Residual norm 0.0281101
> >> > >     2 KSP Residual norm 0.00773873
> >> > >     3 KSP Residual norm 0.00165731
> >> > >     4 KSP Residual norm 0.000395614
> >> > >     5 KSP Residual norm 8.67655e-05
> >> > >     6 KSP Residual norm 1.69495e-05
> >> > >     7 KSP Residual norm 3.70051e-06
> >> > >     8 KSP Residual norm 5.97067e-07
> >> > >     9 KSP Residual norm 1.02242e-07
> >> > >    10 KSP Residual norm 1.75727e-08
> >> > >    11 KSP Residual norm 3.84826e-09
> >> > >    12 KSP Residual norm 6.414e-10
> >> > >    13 KSP Residual norm 1.380e-10
> >> > >   Linear solve converged due to CONVERGED_RTOL iterations 13
> >> > >     0 KSP Residual norm 3.32846e-06
> >> > >     1 KSP Residual norm 8.99139e-07
> >> > >     2 KSP Residual norm 1.72893e-07
> >> > >     3 KSP Residual norm 3.733e-08
> >> > >     4 KSP Residual norm 6.67427e-09
> >> > >     5 KSP Residual norm 1.22785e-09
> >> > >     6 KSP Residual norm 2.551e-10
> >> > >     7 KSP Residual norm 5.458e-11
> >> > >     8 KSP Residual norm 1.050e-11
> >> > >     9 KSP Residual norm < 1.e-11
> >> > >    10 KSP Residual norm < 1.e-11
> >> > >    11 KSP Residual norm < 1.e-11
> >> > >    12 KSP Residual norm < 1.e-11
> >> > >   Linear solve converged due to CONVERGED_RTOL iterations 12
> >> > > Number of SNES iterations = 2
> >> > > 3,5c3,14
> >> > > <   1 SNES Function norm 4.12227e-06
> >> > > <   2 SNES Function norm 6.098e-11
> >> > > < Number of SNES iterations = 2
> >> > > ---
> >> > > >     0 KSP Residual norm 0.0406612
> >> > > >     1 KSP Residual norm 0.21263
> >> > > >     2 KSP Residual norm 1.09192
> >> > > >     3 KSP Residual norm 6.9087
> >> > > >     4 KSP Residual norm 23.4292
> >> > > >     5 KSP Residual norm 57.7558
> >> > > >     6 KSP Residual norm 118.076
> >> > > >     7 KSP Residual norm 213.527
> >> > > >     8 KSP Residual norm 354.101
> >> > > >     9 KSP Residual norm 550.58
> >> > > >   Linear solve did not converge due to DIVERGED_DTOL iterations 9
> >> > > > Number of SNES iterations = 0
> >> > > /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tutorials
> >> > > Possible problem with ex19 running with hypre, diffs above
> >> > > =========================================
> >> > > gmake[3]: [makefile:115: runex3k_kokkos] Error 134 (ignored)
> >> > > 21,25c21,26
> >> > > <   1 SNES Function norm 2.952582418265e-01
> >> > > <   2 SNES Function norm 4.502293658739e-04
> >> > > <   3 SNES Function norm 1.389665806646e-09
> >> > > < Number of SNES iterations = 3
> >> > > < Norm of error 1.49752e-10 Iterations 3
> >> > > ---
> >> > > > Memory access fault by GPU node-4 (Agent handle: 0xb08c90) on
> >> address
> >> > > 0xe17000. Reason: Page not present or supervisor privilege.
> >> > > > Memory access fault by GPU node-5 (Agent handle: 0xb0d3c0) on
> >> address
> >> > > 0xe11000. Reason: Page not present or supervisor privilege.
> >> > > > srun: error: spock25: task 0: Aborted
> >> > > > srun: launch/slurm: _step_signal: Terminating StepId=304034.3
> >> > > > slurmstepd: error: *** STEP 304034.3 ON spock25 CANCELLED AT
> >> > > 2021-12-10T08:08:40 ***
> >> > > > srun: error: spock25: task 1: Aborted (core dumped)
> >> > > /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tutorials
> >> > > Possible problem with ex3k running with kokkos-kernels, diffs above
> >> > > =========================================
> >> > > *******************Error detected during compile or
> >> link!*******************
> >> > > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> >> > > /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tutorials ex5f
> >> > > *********************************************************
> >> > > ftn -fPIC   -fPIC    -I/gpfs/alpine/csc314/scratch/adams/petsc/include
> >> > > -I/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-spock/include
> >> > >
> >> -I/gpfs/alpine/geo127/proj-shared/spock/petsc/current/arch-opt-cray/include
> >> > > -I/opt/rocm-4.3.0/include     ex5f.F90
> >> > >
> >> -Wl,-rpath,/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-spock/lib
> >> > > -L/gpfs/alpine/csc314/scratch/adams/petsc/arch-olcf-spock/lib
> >> > >
> >> -Wl,-rpath,/gpfs/alpine/geo127/proj-shared/spock/petsc/current/arch-opt-cray/lib
> >> > >
> >> -L/gpfs/alpine/geo127/proj-shared/spock/petsc/current/arch-opt-cray/lib
> >> > > -Wl,-rpath,/opt/rocm-4.3.0/lib -L/opt/rocm-4.3.0/lib
> >> > > -Wl,-rpath,/opt/cray/pe/mpich/8.1.10/gtl/lib
> >> > > -L/opt/cray/pe/mpich/8.1.10/gtl/lib
> >> > > -Wl,-rpath,/opt/cray/pe/gcc/8.1.0/snos/lib64
> >> > > -L/opt/cray/pe/gcc/8.1.0/snos/lib64 -Wl,-rpath,/opt/cray/pe/libsci/
> >> > > 21.08.1.2/CRAY/9.0/x86_64/lib -L/opt/cray/pe/libsci/
> >> > > 21.08.1.2/CRAY/9.0/x86_64/lib
> >> > > -Wl,-rpath,/opt/cray/pe/mpich/8.1.10/ofi/cray/10.0/lib
> >> > > -L/opt/cray/pe/mpich/8.1.10/ofi/cray/10.0/lib
> >> > > -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib
> >> > > -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib
> >> -Wl,-rpath,/opt/cray/pe/pmi/6.0.14/lib
> >> > > -L/opt/cray/pe/pmi/6.0.14/lib
> >> > > -Wl,-rpath,/opt/cray/pe/cce/12.0.3/cce/x86_64/lib
> >> > > -L/opt/cray/pe/cce/12.0.3/cce/x86_64/lib
> >> > > -Wl,-rpath,/opt/cray/xpmem/2.2.40-2.1_2.44__g3cf3325.shasta/lib64
> >> > > -L/opt/cray/xpmem/2.2.40-2.1_2.44__g3cf3325.shasta/lib64
> >> > >
> >> -Wl,-rpath,/opt/cray/pe/cce/12.0.3/cce-clang/x86_64/lib/clang/12.0.0/lib/linux
> >> > > -L/opt/cray/pe/cce/12.0.3/cce-clang/x86_64/lib/clang/12.0.0/lib/linux
> >> > > -Wl,-rpath,/opt/cray/pe/gcc/8.1.0/snos/lib/gcc/x86_64-suse-linux/8.1.0
> >> > > -L/opt/cray/pe/gcc/8.1.0/snos/lib/gcc/x86_64-suse-linux/8.1.0
> >> > >
> >> -Wl,-rpath,/opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-unknown-linux-gnu/lib
> >> > > -L/opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-unknown-linux-gnu/lib
> >> > > -lpetsc -lHYPRE -lkokkoskernels -lkokkoscontainers -lkokkoscore
> >> -lhipsparse
> >> > > -lhipblas -lrocsparse -lrocsolver -lrocblas -lrocrand -lamdhip64
> >> -lstdc++
> >> > > -ldl -lmpi_gtl_hsa -lmpifort_cray -lmpi_cray -ldsmml -lpmi -lpmi2
> >> -lxpmem
> >> > > -lpgas-shmem -lquadmath -lmodules -lfi -lcraymath -lf -lu -lcsup
> >> -lgfortran
> >> > > -lpthread -lgcc_eh -lm -lclang_rt.craypgo-x86_64
> >> -lclang_rt.builtins-x86_64
> >> > > -lquadmath -lstdc++ -ldl -lmpi_gtl_hsa -o ex5f
> >> > > /opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-pc-linux-gnu/bin/ld:
> >> > > warning: alignment 128 of symbol
> >> > > `$host_init$$runtime_init_for_iso_c_binding$iso_c_binding_' in
> >> > > /opt/cray/pe/cce/12.0.3/cce/x86_64/lib/libmodules.so is smaller than
> >> 256 in
> >> > > /tmp/pe_46424/ex5f_1.o
> >> > > /opt/cray/pe/cce/12.0.3/binutils/x86_64/x86_64-pc-linux-gnu/bin/ld:
> >> > > warning: alignment 64 of symbol `$data_init$iso_c_binding_' in
> >> > > /opt/cray/pe/cce/12.0.3/cce/x86_64/lib/libmodules.so is smaller than
> >> 256 in
> >> > > /tmp/pe_46424/ex5f_1.o
> >> > > Possible error running Fortran example src/snes/tutorials/ex5f with 1
> >> MPI
> >> > > process
> >> > > See http://www.mcs.anl.gov/petsc/documentation/faq.html
> >> > >     0 KSP Residual norm < 1.e-11
> >> > >   Linear solve converged due to CONVERGED_ATOL iterations 0
> >> > >
> >> > > On Fri, Dec 10, 2021 at 8:07 AM Mark Adams <mfadams at lbl.gov> wrote:
> >> > >
> >> > > > I am trying to get Spock working (again) and am having problems.
> >> > > >
> >> > > > * make check seems to fail but it is hard to see what is going on.
> >> Maybe
> >> > > > we should start here, but let me continue.
> >> > > >
> >> > > > * GAMG seems to work on the CPU
> >> > > >
> >> > > > * I have this for configuring with Kokkos. I am guessing these
> >> versions
> >> > > > are out of data. What is current practice:
> >> > > >     '--with-kokkos-hip-arch=VEGA908',
> >> > > >     '--download-kokkos-commit=3.4.01',
> >> > > >     '--download-kokkos-kernels-commit=3.4.01',
> >> > > >
> >> > > > * Should I hold off (and tell my eager user to do same)?
> >> > > >
> >> > > > Thanks,
> >> > > > Mark
> >> > > >
> >> > >
> >> >
> >>
> >>
> 



More information about the petsc-dev mailing list