<div dir="ltr"><div class="gmail_default" style="font-size:small">Hi,</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">I am trying to test the parallel scalablity of iterative solver (CG with BJacobi preconditioner) in PETSc.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Since the iteration number increases with more processors, I calculated the single iteration time by dividing the total KSPSolve time by number of iteration in this test.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">The linear system I'm solving has 315342 unknowns. Only KSPSolve cost is analyzed.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">The results show that the parallelism works well with small number of processes (less than 32 in my case), and is almost perfect parallel within first 10 processors. </div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">However, the effect of parallelization degrades if I use more processors. The wired thing is that with more than 100 processors, the single iteration cost is slightly increasing.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">To investigate this issue, I then looked into the composition of KSPSolve time.</div><div class="gmail_default" style="font-size:small">It seems KSPSolve consists of MatMult, VecTDot(min),VecNorm(min),<wbr>VecAXPY(max),VecAXPX(max),<wbr>ApplyPC. Please correct me if I'm wrong.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">And I found for small number of processors, all these components scale well. </div><div class="gmail_default" style="font-size:small">However, using more processors(roughly larger than 40), MatMult, VecTDot(min),VecNorm(min) behaves worse, and even increasing after 100 processors, while the other three parts parallel well even for 1000 processors.</div><div class="gmail_default" style="font-size:small">Since MatMult composed major cost in single iteration, the total single iteration cost increases as well.(See the below figure).</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">My question:</div><div class="gmail_default" style="font-size:small">1. Is such situation reasonable? Could anyone explain why MatMult scales poor after certain number of processors? I heard some about different algorithms for matrix multiplication. Is that the bottleneck?</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">2. Is the parallelism dependent of matrix size? If I use larger system size,e.g. million , can the solver scale better?</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">3. Do you have any idea to improve the parallel performance?</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Thank you very much.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">JInlei</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default"><img src="cid:ii_157076e42bc6f601" alt="Inline image 1" width="544" height="408"><br></div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small"><br></div></div>