<div dir="ltr"><div>Aha - they already know. A commit from April in their main repo fixes exactly this problem. It's present in the 4.0.a.2 alpha, but</div><div>it looks like we have to wait until a new release version with the fix in.</div><div><br></div><div><a href="https://github.com/pmodels/yaksa/commit/eed193d9775dd0f33cbd8caa0dd946647b751b18#diff-f5310b2c9b83ad225b424c6ab70b970c2c57a4db39daf7b4f8c017df92646c84">https://github.com/pmodels/yaksa/commit/eed193d9775dd0f33cbd8caa0dd946647b751b18#diff-f5310b2c9b83ad225b424c6ab70b970c2c57a4db39daf7b4f8c017df92646c84</a></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 8, 2021 at 7:36 PM Satish Balay <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This looks like a bug report to mpich.<br>
<br>
You should be able to reproduce this without PETSc - by directly building MPICH.<br>
<br>
And then report to MPICH developers.<br>
<br>
Wrt petsc - you could try --download-openmpi [instead of --download-mpich] and see if that works better.<br>
<br>
Yeah cuda obtained from nvida and cuda repackaged by ubuntu have subtle differences that can cause cuda failures.<br>
<br>
As you say - mpich might have a configure option to disable this. If you are able to find this option - you can use it via petsc configure with:<br>
<br>
--download-mpich-configure-arguments=string<br>
<br>
Satish<br>
<br>
On Mon, 8 Nov 2021, Daniel Stone wrote:<br>
<br>
> Hello all,<br>
> <br>
> I've been having some configure failures trying to configure petsc, on<br>
> Ubuntu 20, when<br>
> downloading mpich.<br>
> <br>
> <br>
> This seems to be related to the use of<br>
> "#!/bin/sh"<br>
> found in the script<br>
> mpich-3.4.2/modules/yaksa/src/backend/cuda/cudalt.sh<br>
> <br>
> /bin/sh in Ubuntu20 is dash, not bash, and line 35 of the script is:<br>
> CMD="${@:2} -Xcompiler -fPIC -o $PIC_FILEPATH"<br>
> which is apparently not valid dash syntax. I see "bad substitution" errors<br>
> when<br>
> trying to run this script in isolation, which can be fixed by replacing the<br>
> top line with<br>
> <br>
> "#!/bin/bash"<br>
> <br>
> The petsc config log points to this line in this script:<br>
> <br>
> ------------------------------------------------------------------------------------<br>
> <br>
> make[2]: Entering directory<br>
> '/home/david/petsc/petsc_opt/externalpackages/mpich-3.4.2/modules/yaksa'<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_hvector__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_hvector_hvector__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_hvector_blkhindx__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_hvector_hindexed__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_hvector_contig__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_blkhindx_hvector__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_hvector_resized__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_blkhindx__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_blkhindx_blkhindx__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_blkhindx_hindexed__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_blkhindx_contig__Bool.lo<br>
> NVCC src/backend/cuda/pup/yaksuri_cudai_pup_blkhindx_resized__Bool.lo<br>
> make[2]: Leaving directory<br>
> '/home/david/petsc/petsc_opt/externalpackages/mpich-3.4.2/modules/yaksa'<br>
> make[1]: Leaving directory<br>
> '/home/david/petsc/petsc_opt/externalpackages/mpich-3.4.2'/usr/bin/ar: `u'<br>
> modifier ignored since `D' is the default (see `U')<br>
> /usr/bin/ar: `u' modifier ignored since `D' is the default (see `U')<br>
> /usr/bin/ar: `u' modifier ignored since `D' is the default (see `U')<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> ./src/backend/cuda/cudalt.sh: 35: Bad substitution<br>
> make[2]: *** [Makefile:8697:<br>
> src/backend/cuda/pup/yaksuri_cudai_pup_hvector__Bool.lo] Error 2<br>
> make[2]: *** Waiting for unfinished jobs....<br>
> make[2]: *** [Makefile:8697:<br>
> src/backend/cuda/pup/yaksuri_cudai_pup_hvector_hvector__Bool.lo] Error 2<br>
> make[2]: *** [Makefile:8697:<br>
> src/backend/cuda/pup/yaksuri_cudai_pup_hvector_blkhindx__Bool.lo] Error 2<br>
> <br>
> ----------------------------------------------------------------------------<br>
> <br>
> What is interesting is the choice made by the config script here to make<br>
> yaksa "cuda-aware", which I do not<br>
> understand how to control. By this I mean - the use of NVCC, the use of<br>
> files with "cudai" in the name,<br>
> and the running of the cudalt.sh script.<br>
> <br>
> This is especially odd given that on another machine, also with Ubuntu20,<br>
> none of this occurs, despite<br>
> using the exact same configure instructions:<br>
> <br>
> ---------------------------------------------------<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_blkhindx_resized__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_hindexed__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_hindexed_hvector__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_hindexed_blkhindx__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_hindexed_hindexed__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_hindexed_contig__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_hindexed_resized__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_contig__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_contig_hvector__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_contig_blkhindx__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_contig_hindexed__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_contig_contig__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_contig_resized__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_resized__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_resized_hvector__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_resized_blkhindx__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_resized_hindexed__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_resized_contig__Bool.lo<br>
> CC src/backend/seq/pup/yaksuri_seqi_pup_resized_resized__Bool.lo<br>
> <br>
> <br>
> -------------------------------------------------------<br>
> <br>
> On both machines petsc locates a working nvcc without trouble. One<br>
> difference is<br>
> that on the good machine, cuda 11 is installed, built from source, while on<br>
> the bad machine<br>
> cuda 10 is installed, installed via apt-get. On the good machine, running<br>
> the cudalt.sh script in<br>
> isolation results in the same bad substitution error.<br>
> <br>
> Can someone help me understand where the difference of behaviour might come<br>
> from, w.r.t.<br>
> cuda? Adding the --with-cuda=0 flag on the bad machine made no difference.<br>
> Is there any way<br>
> of communicating to mpich/yaksuri that I don't want whatever features that<br>
> involve the cudalt.sh<br>
> script being run?<br>
> <br>
> The configure command I use is:<br>
> <br>
> ./configure --download-mpich=yes --download-hdf5=yes<br>
> --download-fblaslapack=yes --download-metis=yes --download-cmake=yes<br>
> --download-ptscotch=yes --download-hypre=yes --with-debugging=0<br>
> COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3<br>
> -download-hdf5-fortran-bindings=yes --download-sowing<br>
> <br>
> <br>
> <br>
> Thanks,<br>
> <br>
> Daniel<br>
> <br>
<br>
</blockquote></div>