<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div><a href="https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html" class="">https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/asynchronous-progress-control.html</a><div class=""><br class=""></div><div class="">It states "<span style="caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px; background-color: rgb(226, 231, 235);" class="">and a partial support for non-blocking collectives ( </span><span data-outputclass="codeph" class="ph systemoutput" style="box-sizing: border-box; caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px;">MPI_Ibcas</span><span style="caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px; background-color: rgb(226, 231, 235);" class=""> t, </span><span data-outputclass="codeph" class="ph systemoutput" style="box-sizing: border-box; caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px;">MPI_Ireduce</span><span style="caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px; background-color: rgb(226, 231, 235);" class=""> , and </span><span data-outputclass="codeph" class="ph systemoutput" style="box-sizing: border-box; caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px;">MPI_Iallreduce</span><span style="caret-color: rgb(85, 85, 85); color: rgb(85, 85, 85); font-family: intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif; font-size: 16px; background-color: rgb(226, 231, 235);" class=""> )."  I do not know what partial support means but you can try setting the variables and see if that helps.</span></div><div class=""><font color="#555555" face="intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif" size="3" class=""><span style="caret-color: rgb(85, 85, 85); background-color: rgb(226, 231, 235);" class=""><br class=""></span></font></div><div class=""><font color="#555555" face="intel-clear, tahoma, Helvetica, helvetica, Arial, sans-serif" size="3" class=""><span style="caret-color: rgb(85, 85, 85); background-color: rgb(226, 231, 235);" class=""><br class=""></span></font><div><br class=""><blockquote type="cite" class=""><div class="">On Jan 22, 2021, at 11:20 AM, Viet H.Q.H. <<a href="mailto:hqhviet@tohoku.ac.jp" class="">hqhviet@tohoku.ac.jp</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div dir="ltr" class=""><br class=""><div class="">Dear Victor and Berry,<br class=""><br class="">Thank you so much for your answers.<br class=""><br class="">I fixed the code with the bug in the PetscCommSplitReductionBegin function as commented by Brave.<br class=""><br class="">   <font color="#0000ff" class="">  ierr = PetscCommSplitReductionBegin (PetscObjectComm ((PetscObject) u));</font><br class=""><br class="">It was also a mistake to set the vector size too small.<br class="">I just set a vector size of 100000000 and ran the code on 4 nodes with 2 processors per node. The result is as follows<br class=""><br class="">The time used for the asynchronous calculation: <font color="#ff0000" class="">0.022043</font><br class="">+ | u | = 10000.<br class="">The time used for the synchronous calculation: <font color="#ff0000" class="">0.016188</font><br class="">+ | b | = 10000.<br class=""></div><div class=""><br class=""></div><div class="">Asynchronous computation still takes a longer time.</div><div class=""><br class=""></div><div class="">I also confirmed that PETSC_HAVE_MPI_IALLREDUCE is defined in the file $PETSC_DIR/include/petscconf.h<br class=""><br class="">I built Petsc by using the following script</div><div class=""><br class=""></div><div class=""><font color="#0000ff" class="">#!/usr/bin/bash<br class="">set -e<br class="">DATE="21.01.18"<br class="">MPIIT_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mpi/intel64"<br class="">MKL_DIR="/work/A/intel/2018_update2/compilers_and_libraries_2018.2.199/linux/mkl"<br class="">INSTL_DIR="${HOME}/local/petsc-3.14.3"<br class="">BUILD_DIR="${HOME}/tmp/petsc/build_${DATE}"<br class="">PETSC_DIR="${HOME}/tmp/petsc"<br class=""><br class="">cd ${PETSC_DIR}<br class="">./configure --force --prefix=${INSTL_DIR} --with-mpi-dir=${MPIIT_DIR}  --with-fortran-bindings=0 --with-mpiexe=${MPIIT_DIR}/bin/mpiexec --with-valgrind-dir=${HOME}/local/valgrind --with-blaslapack-dir=${MKL_DIR} --download-make --with-debugging=0 COPTFLAGS='-O3 -march=native -mtune=native' CXXOPTFLAGS='-O3 -march=native -mtune=native' FOPTFLAGS='-O3 -march=native -mtune=native'<br class=""><br class="">make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt all<br class="">make PETSC_DIR=${HOME}/tmp/petsc PETSC_ARCH=arch-linux2-c-opt install </font><br class=""></div><div class=""><br class=""><br class="">Intel 2018 also complies with the MPI-3 standard.<br class=""><br class="">Are there specific settings for Intel MPI to obtain the performance of the MPI_IALLREDUCE function?<br class=""></div><div class=""><br class=""></div><div class="">Sincerely,</div><div class="">Viet.</div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 22, 2021 at 11:20 AM Barry Smith <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class=""><div class=""><br class=""></div><span style="color:rgb(0,0,255)" class=""> </span><span class=""> ierr = VecNormBegin(u,NORM_2,&norm1);</span><br class=""><span class="">    ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax)); </span><div class=""><span class=""><br class=""></span></div><div class=""><span class="">How come you call this on Ax and not on u? For clarity, if nothing else, I think you should call it on u.</span></div><div class=""><span class=""><br class=""></span></div><div class=""><span class="">comb.c has </span></div><div class=""><span class=""><br class=""></span></div><div class=""><div class="">/*</div><div class="">      Split phase global vector reductions with support for combining the</div><div class="">   communication portion of several operations. Using MPI-1.1 support only</div><div class=""><br class=""></div><div class="">      The idea for this and much of the initial code is contributed by</div><div class="">   Victor Eijkhout.</div><div class=""><br class=""></div><div class="">       Usage:</div><div class="">             VecDotBegin(Vec,Vec,PetscScalar *);</div><div class="">             VecNormBegin(Vec,NormType,PetscReal *);</div><div class="">             ....</div><div class="">             VecDotEnd(Vec,Vec,PetscScalar *);</div><div class="">             VecNormEnd(Vec,NormType,PetscReal *);</div><div class=""><br class=""></div><div class="">       Limitations:</div><div class="">         - The order of the xxxEnd() functions MUST be in the same order</div><div class="">           as the xxxBegin(). There is extensive error checking to try to</div><div class="">           insure that the user calls the routines in the correct order</div><div class="">*/</div><div class=""><br class=""></div><div class="">#include <petsc/private/vecimpl.h>    /*I   "petscvec.h"    I*/</div><div class=""><br class=""></div><div class="">static PetscErrorCode MPIPetsc_Iallreduce(void *sendbuf,void *recvbuf,PetscMPIInt count,MPI_Datatype datatype,MPI_Op op,MPI_Comm comm,MPI_Request *request)</div><div class="">{</div><div class="">  PETSC_UNUSED PetscErrorCode ierr;</div><div class=""><br class=""></div><div class="">  PetscFunctionBegin;</div><div class="">#if defined(PETSC_HAVE_MPI_IALLREDUCE)</div><div class="">  ierr = MPI_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRMPI(ierr);</div><div class="">#elif defined(PETSC_HAVE_MPIX_IALLREDUCE)</div><div class="">  ierr = MPIX_Iallreduce(sendbuf,recvbuf,count,datatype,op,comm,request);CHKERRQ(ierr);</div><div class="">#else</div><div class="">  ierr = MPIU_Allreduce(sendbuf,recvbuf,count,datatype,op,comm);CHKERRQ(ierr);</div><div class="">  *request = MPI_REQUEST_NULL;</div><div class="">#endif</div><div class="">  PetscFunctionReturn(0);</div><div class="">}</div><div style="color:rgb(0,0,255)" class=""><br class=""></div></div><div class=""><font color="#0000ff" class=""><span class=""><br class=""></span></font><div class="">So first check if $PETSC_DIR/include/petscconf.h has </div><div class=""><br class=""></div><div class=""><span class="">PETSC_HAVE_MPI_IALLREDUCE</span></div><div class=""><span class=""><br class=""></span></div><div class=""><span class="">if it does not then the standard MPI reduce is called. </span></div><div class=""><span class=""><br class=""></span></div><div class=""><span class="">If this is set then any improvement depends on the implementation of iallreduce inside the MPI you are using. </span></div><div class=""><span class=""><br class=""></span></div><div class=""><span class="">Barry</span></div><div class=""><span class=""><br class=""></span></div><div class=""><font color="#0000ff" class=""><span class=""><br class=""></span></font><blockquote type="cite" class=""><div class="">On Jan 21, 2021, at 6:52 AM, Viet H.Q.H. <<a href="mailto:hqhviet@tohoku.ac.jp" target="_blank" class="">hqhviet@tohoku.ac.jp</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">Hello Petsc developers and supporters,<br class=""><br class="">I would like to confirm the performance of asynchronous computations of inner product computation overlapping with matrix-vector multiplication computation by the below code.<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><font color="#0000ff" class=""> PetscLogDouble tt1,tt2;<br class="">    KSP ksp;<br class="">    //ierr = VecSet(c,one);<br class="">    ierr = VecSet(c,one);<br class="">    ierr = VecSet(u,one);<br class="">    ierr = VecSet(b,one);<br class=""><br class="">    ierr = KSPCreate(PETSC_COMM_WORLD,&ksp); CHKERRQ(ierr);<br class="">    ierr = KSP_MatMult(ksp,A,x,Ax); CHKERRQ(ierr);<br class=""><br class=""><br class="">    ierr = PetscTime(&tt1);CHKERRQ(ierr);<br class="">    ierr = VecNormBegin(u,NORM_2,&norm1);<br class="">    ierr = PetscCommSplitReductionBegin(PetscObjectComm((PetscObject)Ax)); <br class="">    ierr = KSP_MatMult(ksp,A,c,Ac); <br class="">    ierr = VecNormEnd(u,NORM_2,&norm1);<br class="">    ierr = PetscTime(&tt2);CHKERRQ(ierr);<br class=""><br class="">    ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the asynchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);<br class="">    ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |u| =  %g\n",(double) norm1); CHKERRQ(ierr);<br class=""><br class=""><br class="">    ierr = PetscTime(&tt1);CHKERRQ(ierr);<br class="">    ierr = VecNorm(b,NORM_2,&norm2); CHKERRQ(ierr);<br class="">    ierr = KSP_MatMult(ksp,A,c,Ac); <br class="">    ierr = PetscTime(&tt2);CHKERRQ(ierr);<br class=""><br class="">    ierr = PetscPrintf(PETSC_COMM_WORLD, "The time used for the synchronous calculation: %f\n",tt2-tt1); CHKERRQ(ierr);<br class="">    ierr = PetscPrintf(PETSC_COMM_WORLD,"+ |b| =  %g\n",(double) norm2); CHKERRQ(ierr);<br class=""></font><div class=""><font color="#0000ff" class=""><br class=""></font></div><div class=""><br class=""></div><div class="">On a cluster with two or four nodes, the asynchronous computation is always much slower than synchronous computation.<br class=""><br class=""><font color="#ff0000" class="">The time used for the asynchronous calculation: 0.000203<br class="">+ |u| =  100.<br class="">The time used for the synchronous calculation: 0.000006<br class="">+ |b| =  100.</font><br class=""><br class="">Are there any necessary settings on MPI or Petsc to gain performance of asynchronous computation?<br class=""><br class="">Thank you very much for anything you can provide.<br class="">Sincerely,<br class="">Viet.<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div></div>

</div></blockquote></div><br class=""></div></div></blockquote></div></div>

</div></blockquote></div><br class=""></div></body></html>