<p>
Thank you a lot for your analysis and suggestions, I quite agree with your opinion for the difference of theoretical and actual. I'll try to change into MPICH-3.4 rather than MVAPICH-2.3.5 I've used before.
</p>
<p>
<br>
</p>
<p>
<br>
</p>
<p>
Thanks,
</p>
<p>
Gang
</p>
<br>
<br>
<blockquote name="replyContent" class="ReferenceQuote" style="padding-left:5px;margin-left:5px;border-left:#b6b6b6 2px solid;margin-right:0;">
-----原始邮件-----<br>
<b>发件人:</b><span id="rc_from">"Barry Smith" <bsmith@petsc.dev></span><br>
<b>发送时间:</b><span id="rc_senttime">2021-02-18 13:09:43 (星期四)</span><br>
<b>收件人:</b> "赵刚" <zhaog6@lsec.cc.ac.cn><br>
<b>抄送:</b> PETSc <petsc-users@mcs.anl.gov><br>
<b>主题:</b> Re: [petsc-users] An issue about pipelined CG and Gropp's CG<br>
<br>
<div class="">
<br class="">
</div>
Here are the important operations from the -log_view (use a fixed sized font for easy reading).
<div class="">
<br class="">
</div>
<div class="">
No pipeline<br class="">
<div class="">
<br class="">
</div>
<div class="">
<div class="">
------------------------------------------------------------------------------------------------------------------------
</div>
<div class="">
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total
</div>
<div class="">
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
</div>
<div class="">
------------------------------------------------------------------------------------------------------------------------
</div>
<div class="">
<br class="">
</div>
<div class="">
<div class="">
<div class="">
MatMult 5398 1.0 9.4707e+0012.6 1.05e+09 1.1 3.6e+07 6.9e+02 0.0e+00 3 52100100 0 10 52100100 0 124335
</div>
<div class="">
VecTDot 10796 1.0 1.4993e+01 8.3 3.23e+08 1.1 0.0e+00 0.0e+00 1.1e+04 16 16 0 0 67 55 16 0 0 67 24172
</div>
<div class="">
VecNorm 5399 1.0 6.2343e+00 4.4 1.61e+08 1.1 0.0e+00 0.0e+00 5.4e+03 10 8 0 0 33 33 8 0 0 33 29073
</div>
<div class="">
VecAXPY 10796 1.0 1.1721e-01 1.4 3.23e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 16 0 0 0 1 16 0 0 0 3092074
</div>
<div class="">
VecAYPX 5397 1.0 5.4340e-02 1.4 1.61e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 3334231
</div>
<div class="">
VecScatterBegin 5398 1.0 5.4152e-02 3.3 0.00e+00 0.0 3.6e+07 6.9e+02 0.0e+00 0 0100100 0 0 0100100 0 0
</div>
<div class="">
VecScatterEnd 5398 1.0 8.6881e+00489.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 6 0 0 0 0 0
</div>
<div class="">
KSPSolve 1 1.0 1.7389e+01 1.0 2.02e+09 1.1 3.6e+07 6.9e+02 1.6e+04 29100100100100 100100100100100 130242
</div>
</div>
<div class="">
<br class="">
</div>
</div>
<div>
Gropp pipeline
</div>
<div>
<br class="">
</div>
<div>
<div>
MatMult 5399 1.0 9.5593e+0011.7 1.05e+09 1.1 3.6e+07 6.9e+02 0.0e+00 3 45100100 0 7 45100100 0 123207
</div>
<div>
VecNorm 1 1.0 8.8549e-0417.4 2.99e+04 1.1 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 4 0 0 0 0 20 37912
</div>
<div>
VecAXPY 16194 1.0 1.6522e-01 1.4 4.84e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 21 0 0 0 0 21 0 0 0 3290407
</div>
<div>
VecAYPX 10794 1.0 1.9903e-01 1.5 3.23e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 14 0 0 0 1 14 0 0 0 1820606
</div>
<div>
VecScatterBegin 5399 1.0 6.2281e-02 3.6 0.00e+00 0.0 3.6e+07 6.9e+02 0.0e+00 0 0100100 0 0 0100100 0 0
</div>
<div>
VecScatterEnd 5399 1.0 8.7194e+00380.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 4 0 0 0 0 0
</div>
<div>
VecReduceArith 16195 1.0 2.2674e-01 3.7 4.84e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 21 0 0 0 0 21 0 0 0 2397678
</div>
<div>
VecReduceBegin 10797 1.0 3.4089e-02 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
</div>
<div>
VecReduceEnd 10797 1.0 2.6197e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 37 0 0 0 0 91 0 0 0 0 0
</div>
<div>
SFBcastOpBegin 5399 1.0 6.0051e-02 4.1 0.00e+00 0.0 3.6e+07 6.9e+02 0.0e+00 0 0100100 0 0 0100100 0 0
</div>
<div>
SFBcastOpEnd 5399 1.0 8.7167e+00440.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 4 0 0 0 0 0
</div>
<div>
KSPSolve 1 1.0 2.7477e+01 1.0 2.34e+09 1.1 3.6e+07 6.9e+02 1.0e+00 41100100100 4 100100100100 20 95623
</div>
<div>
<br class="">
</div>
<div>
pipeline cg
</div>
<div>
<br class="">
</div>
<div>
<div>
MatMult 5400 1.0 1.5915e+00 1.8 1.05e+09 1.1 3.6e+07 6.9e+02 0.0e+00 2 37100100 0 6 37100100 0 740161
</div>
<div>
VecAXPY 21592 1.0 2.3194e-01 1.4 6.45e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 23 0 0 0 1 23 0 0 0 3125164
</div>
<div>
VecAYPX 21588 1.0 5.5059e-01 1.7 6.45e+08 1.1 0.0e+00 0.0e+00 0.0e+00 1 23 0 0 0 2 23 0 0 0 1316272
</div>
<div>
VecScatterBegin 5400 1.0 7.0132e-02 3.7 0.00e+00 0.0 3.6e+07 6.9e+02 0.0e+00 0 0100100 0 0 0100100 0 0
</div>
<div>
VecScatterEnd 5400 1.0 6.5329e-0122.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0
</div>
<div>
VecReduceArith 16197 1.0 3.1135e-01 4.7 4.84e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 17 0 0 0 1 17 0 0 0 1746339
</div>
<div>
VecReduceBegin 5400 1.0 3.1471e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
</div>
<div>
VecReduceEnd 5400 1.0 1.7226e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 28 0 0 0 0 90 0 0 0 0 0
</div>
<div>
SFBcastOpBegin 5400 1.0 6.6228e-02 4.1 0.00e+00 0.0 3.6e+07 6.9e+02 0.0e+00 0 0100100 0 0 0100100 0 0
</div>
<div>
SFBcastOpEnd 5400 1.0 6.5000e-0124.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0
</div>
<div>
KSPSolve 1 1.0 1.8893e+01 1.0 2.82e+09 1.1 3.6e+07 6.9e+02 0.0e+00 32100100100 0 100100100100 0 167860
</div>
<div>
<br class="">
</div>
<div>
With pipelined methods the TDot and Vec norm are replaced with VecReduceArith, VecReduceBegin, and VecReduceEnd. The important numbers are
</div>
<div>
the %T in the stage.
</div>
<div>
<br class="">
</div>
<div>
In particular look at VecTDot and VecNorm and compare to VecReduceEnd in the pipeline methods. Note that both pipelined methods, especially the gropp method spend an enormous time in VecReduceEnd and hence end up taking more time than the non-pipelined method. So basically any advantage the pipeline methods may have is lost waiting for the previous reduction operation to arrive. I do not know why, if it is the MPI implementation or something else.
</div>
<div>
<br class="">
</div>
<div>
If you are serious about understanding pipeline methods for Krylov methods you will need to dig deep into the details of the machine hardware and MPI software. It is not a trivial subject with easy answers. I would say that the PETSc community are not experts on the topic, you will need to read in detail the publications on pipelined methods and consult with the authors on technical, machine specific details. There is a difference between the academic "pipelining as a theoretical construct" and actually dramatic improvement on real machines while using pipelining. One small implementation detail can dramatically change performance so theoretical papers alone are not the complete story.
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
Barry
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
<br class="">
</div>
<div>
------------------------------------------------------------------------------------------------------------------------
</div>
<div>
<br class="">
</div>
</div>
<blockquote type="cite" class="">
<div class="">
On Feb 17, 2021, at 10:31 PM, 赵刚 <<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>> wrote:
</div>
<br class="Apple-interchange-newline">
<div class="">
<p class="">
Dear Barry,
</p>
<p class="">
<br>
</p>
<p class="">
Thank you. For MPI, MVAPICH-2.3.5 is used on my cluster by default, I add PetscLogStagePush("Calling KSPSolve()...") and PetscLogStagePop(). I am using other numerical software and have called PETSc only when solving linear system through PETSc interface supported by the software, but I'm not sure if I have added it correctly. I put the result and info into attachment, please check it.
</p>
<p class="">
<br>
</p>
<p class="">
<br>
</p>
<p class="">
Thanks,
</p>
<p class="">
Gang
</p>
<br class="">
<br class="">
<blockquote name="replyContent" class="ReferenceQuote" style="padding-left:5px;margin-left:5px;border-left:#b6b6b6 2px solid;margin-right:0;">
-----原始邮件-----<br class="">
<b class="">发件人:</b><span id="rc_from" class="">"Barry Smith" <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>></span><br class="">
<b class="">发送时间:</b><span id="rc_senttime" class="">2021-02-18 10:52:11 (星期四)</span><br class="">
<b class="">收件人:</b> "赵刚" <<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>><br class="">
<b class="">抄送:</b> PETSc <<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>><br class="">
<b class="">主题:</b> Re: [petsc-users] An issue about pipelined CG and Gropp's CG<br class="">
<br class="">
<div class="">
<br class="">
</div>
<div class="">
First please see <a href="https://www.mcs.anl.gov/petsc/documentation/faq.html#pipelined" class="">https://www.mcs.anl.gov/petsc/documentation/faq.html#pipelined</a> and verify that the MPI you are using satisfies the requirements and you have appropriate MPI environmental variables set (if needed).
</div>
<div class="">
<br class="">
</div>
<div class="">
<br class="">
</div>
Then please add a stage around the actual computation to get a more useful summary.
<div class="">
<br class="">
</div>
<div class="">
Organize your code like so
</div>
<div class="">
<br class="">
</div>
<div class="">
...
</div>
<div class="">
KSPSetUp()
</div>
<div class="">
PetscLogStagePush(a stage you created)
</div>
<div class="">
KSPSolve()
</div>
<div class="">
PetscLogStagePop()
</div>
<div class="">
...
</div>
<div class="">
<br class="">
<div class="">
It is unclear where much of the time of your code is being spent, by adding the stage we'll have a clear picture of the time in the actual solver. There are examples of using PetscLogStagePush() in the source.
</div>
<div class="">
<br class="">
</div>
<div class="">
With the new -log_view files you generate with these two changes we can get a handle on where the time is being spent and why the pipelining is or is not helping.
</div>
<div class="">
<br class="">
</div>
<div class="">
Barry
</div>
<div class="">
<br class="">
<blockquote type="cite" class="">
<div class="">
On Feb 17, 2021, at 8:31 PM, 赵刚 <<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>> wrote:
</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">
Dear Barry,<br class="">
<br class="">
Thank you for your prompt reply. I run ~16M DOFs on 32 nodes (36 cores per node), but CG seems to be faster than pipelined CG and Gropp's CG, I'm puzzled and haven't figured out why. Put the performance output into attachment, please check it.<br class="">
<br class="">
<br class="">
<br class="">
Thanks,<br class="">
Gang<br class="">
<br class="">
<br class="">
> -----原始邮件-----<br class="">
> 发件人: "Barry Smith" <<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>><br class="">
> 发送时间: 2021-02-18 09:17:17 (星期四)<br class="">
> 收件人: "赵刚" <<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>><br class="">
> 抄送: PETSc <<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>><br class="">
> 主题: Re: [petsc-users] An issue about pipelined CG and Gropp's CG<br class="">
> <br class="">
> <br class="">
> <br class="">
> > On Feb 17, 2021, at 6:47 PM, 赵刚 <<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>> wrote:<br class="">
> > <br class="">
> > Dear PETSc team,<br class="">
> > <br class="">
> > I am interested in pipelined CG (-ksp_type pipecg) and Gropp's CG (-ksp_type groppcg), it is expected that this iterative method with pipelined has advantages over traditional CG in the case of multiple processes. So I'd like to ask for Poisson problem, how many computing nodes do I need to show the advantages of pipelined CG or Gropp's CG over CG (No preconditioner is used)?<br class="">
> > <br class="">
> > Currently, I can only use up to 32 nodes (36 cores per nodes) at most on my cluster, but both "pipecg" and "groppcg" seem to be no advantage over "cg" when I solve Poisson equations with homogeneous Dirichlet BC in [0, 1]^2 (remain 20K~60K DOFs per process). I guess the reason would be too few computing nodes.<br class="">
> <br class="">
> 900 cores (assuming they are not memory bandwidth bound) might be enough to see some differences but the differences are likely so small compared to other parallel issues that affect performance that you see no consistently measurable difference.<br class="">
> <br class="">
> Run with -log_view three cases, no pipeline and the two pipelines and send the output. By studying where the time is spent in the different regions of the code with this output one may be able to say something about the pipeline affect.<br class="">
> <br class="">
> Barry<br class="">
> <br class="">
> <br class="">
> > <br class="">
> > Because I am calling PETSc via other numerical software, if need, I would mail related performance information to you by using command line options suggested by PETSc. Thank you.<br class="">
> > <br class="">
> > <br class="">
> > Thanks,<br class="">
> > Gang<br class="">
</<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>></<a href="mailto:petsc-users@mcs.anl.gov" class="">petsc-users@mcs.anl.gov</a>></<a href="mailto:zhaog6@lsec.cc.ac.cn" class="">zhaog6@lsec.cc.ac.cn</a>></<a href="mailto:bsmith@petsc.dev" class="">bsmith@petsc.dev</a>><span id="cid:183022BF-624E-4B80-9D40-D6D636C3C13C" class=""><cg.out></span><span id="cid:25834817-6AAF-45C3-8744-3E1014D4B3F1" class=""><groppcg.out></span><span id="cid:0E3B6B32-70D7-4500-BF88-104498A1A973" class=""><pipecg.out></span>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
<span id="cid:77AC2314-C946-417F-A80D-D73E0330E51D"><cg.out></span><span id="cid:EDD620E7-6407-4016-84C2-B216F441A397"><groppcg.out></span><span id="cid:F38CC2E5-21E1-4A24-96AB-FC59673D29CB"><pipecg.out></span>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>