<div dir="ltr"> MAT_NO_OFF_PROC_ENTRIES        - you know each process will only set values for its own rows, will generate an error if any process sets values for another process. This avoids all reductions in the MatAssembly routines and thus improves performance for very large process counts.<br>OK, so I am seeing the whole matrix creation, including your flux calcs and your intermediate data structure, as taking 0.1 sec (10/100). That is about 7% of the solve time (but this looks like it could use some attention) or about 25 Mat-vecs (the standard work unit of an iterative solver).<div><br></div><div>Now I see 10% of the matrix creation time going to MatAssembly end, which is annoying because there is no communication.</div><div><br></div><div>I don't see any big problems here, except for the solver maybe, if this is a nice friendly Laplacian.</div><div><br></div><div>* I would try PETSc's Set Value routines directly.</div><div>* You might as well try  <a href="https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetOption.html">https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatSetOption.html</a></div><div><br></div><div>with</div><div><br></div><div> MAT_NO_OFF_PROC_ENTRIES      - you know each process will only set values for its own rows, will generate an error if any process sets values for another process. This avoids all reductions in the MatAssembly routines and thus improves performance for very large process counts.<br></div><div><br></div><div>This should eliminate the MatAssemblyEnd cost.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 13, 2019 at 2:43 PM Mohammed Mostafa <<a href="mailto:mo7ammedmostafa@gmail.com">mo7ammedmostafa@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div dir="auto">Sorry about that </div></div><div dir="auto">I wanted to see if the assembly cost would drop with subsequent time steps but it was taking too long to run so I set it to solve only once since I was only interested in profiling the matrix builder.</div><div dir="auto">Again sorry for that</div><div dir="auto">Kamra</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jul 14, 2019 at 3:33 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Ok, I only see one all to KSPSolve.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 13, 2019 at 2:08 PM Mohammed Mostafa <<a href="mailto:mo7ammedmostafa@gmail.com" target="_blank">mo7ammedmostafa@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div dir="auto">This log is for 100 time-steps, not a single time step</div></div><div dir="auto"><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jul 14, 2019 at 3:01 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">You call the assembly stuff a lot (200). BuildTwoSidedF is a global thing and is taking a lot of time. You should just call these once per time step (it looks like you are just doing one time step).<div><br><div><br><div><pre class="gmail-m_9148722191532272093m_3678564853903919150gmail-m_-3682458186440807404m_8116036447877508053gmail-aLF-aPX-K0-aPE" style="font-family:"Courier New",Courier,monospace,arial,sans-serif;margin-top:0px;margin-bottom:0px;white-space:pre-wrap;color:rgb(0,0,0);font-size:14px">--- Event Stage 1: Matrix Construction


BuildTwoSidedF       400 1.0 6.5222e-01 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   5  0  0  0  0     0

VecSet                 1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

VecAssemblyBegin     200 1.0 6.2633e-01 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   5  0  0  0  0     0

VecAssemblyEnd       200 1.0 6.7163e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

VecScatterBegin      200 1.0 5.9373e-03 2.2 0.00e+00 0.0 3.6e+03 2.1e+03 0.0e+00  0  0 79  2  0   0  0 99100  0     0

VecScatterEnd        200 1.0 2.7236e-0223.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

MatAssemblyBegin     200 1.0 3.2747e-02 5.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

MatAssemblyEnd       200 1.0 9.0972e-01 1.0 0.00e+00 0.0 3.6e+01 5.3e+02 8.0e+00  4  0  1  0  6   9  0  1  0100     0

AssembleMats         200 1.0 1.5568e+00 1.2 0.00e+00 0.0 3.6e+03 2.1e+03 8.0e+00  6  0 79  2  6  14  0100100100     0

myMatSetValues       200 1.0 2.5367e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 11  0  0  0  0  25  0  0  0  0     0

setNativeMat         100 1.0 2.8223e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 12  0  0  0  0  28  0  0  0  0     0

setNativeMatII       100 1.0 3.2174e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 14  0  0  0  0  31  0  0  0  0     0

callScheme           100 1.0 2.0700e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0

</pre><br class="gmail-m_9148722191532272093m_3678564853903919150gmail-m_-3682458186440807404m_8116036447877508053gmail-Apple-interchange-newline"></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 12, 2019 at 11:56 PM Mohammed Mostafa via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hello Matt,</div><div>Attached is the dumped entire log  output using -log_view and -info.</div><div><br></div><div>Thanks,</div><div>Kamra<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 12, 2019 at 9:23 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">On Fri, Jul 12, 2019 at 5:19 AM Mohammed Mostafa via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello all,<div>I have a few question regarding Petsc,</div></div></blockquote><div><br></div><div>Please send the entire output of a run with all the logging turned on, using -log_view and -info.</div><div><br></div><div>  Thanks,</div><div><br></div><div>    Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Question 1:</div><div>For the profiling , is it possible to only show the user defined log events in the breakdown of each stage in Log-view.</div><div>I tried deactivating all ClassIDs, MAT,VEC, PC, KSP,PC,</div><div> PetscLogEventExcludeClass(MAT_CLASSID);<br> PetscLogEventExcludeClass(VEC_CLASSID);<br>       PetscLogEventExcludeClass(KSP_CLASSID);<br>       PetscLogEventExcludeClass(PC_CLASSID);<br></div><div><span style="color:rgb(0,0,0);font-family:"Times New Roman";font-size:medium">which should "Deactivates event logging for a PETSc object class in every stage" according to the manual.</span><br></div><div>however I still see them in the stage breakdown </div><div>--- Event Stage 1: Matrix Construction<br><br>BuildTwoSidedF         4 1.0 2.7364e-02 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  18  0  0  0  0     0<br>VecSet                 1 1.0 4.5300e-06 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>VecAssemblyBegin       2 1.0 2.7344e-02 2.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  18  0  0  0  0     0<br>VecAssemblyEnd         2 1.0 8.3447e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>VecScatterBegin        2 1.0 7.5102e-05 1.7 0.00e+00 0.0 3.6e+01 2.1e+03 0.0e+00  0  0  3  0  0   0  0 50 80  0     0<br>VecScatterEnd          2 1.0 3.5286e-05 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>MatAssemblyBegin       2 1.0 8.8930e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0<br>MatAssemblyEnd         2 1.0 1.3566e-02 1.1 0.00e+00 0.0 3.6e+01 5.3e+02 8.0e+00  0  0  3  0  6  10  0 50 20100     0<br>AssembleMats           2 1.0 3.9774e-02 1.7 0.00e+00 0.0 7.2e+01 1.3e+03 8.0e+00  0  0  7  0  6  28  0100100100     0  # USER EVENT<br>myMatSetValues         2 1.0 2.6931e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  19  0  0  0  0     0   # USER EVENT<br>setNativeMat           1 1.0 3.5613e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  24  0  0  0  0     0   # USER EVENT<br>setNativeMatII         1 1.0 4.7023e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  28  0  0  0  0     0   # USER EVENT<br>callScheme             1 1.0 2.2333e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   2  0  0  0  0     0   # USER EVENT<br></div><div><br></div><div>Also is possible to clear the logs so that I can write a  separate profiling output file for each timestep ( since I am solving a transient problem and I want to know the change in performance as time goes by )</div><div>----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br></div><div>Question 2:</div><div>Regarding MatSetValues</div><div>Right now, I writing a finite volume code, due to algorithm requirement I have to write the matrix into local native format ( array of arrays) and then loop through rows and use MatSetValues to set the elements in "Mat A"</div><div>MatSetValues(A, 1, &row, nj, j_index, coefvalues, INSERT_VALUES);<br></div><div>but it is very slow and it is killing my performance</div><div>although the matrix was properly set using </div><div>MatCreateAIJ(PETSC_COMM_WORLD, this->local_size, this->local_size, PETSC_DETERMINE,<br>       PETSC_DETERMINE, -1, d_nnz, -1, o_nnz, &A);<br></div><div>with d_nnz,and  o_nnz properly assigned so no mallocs occur during matsetvalues and all inserted values are local so no off-processor values</div><div>So my question is it possible to set multiple rows at once hopefully all, I checked the manual and MatSetValues can only set dense matrix block because it seems that row by row is expensive<br></div><div>Or perhaps is it possible to copy all rows to the underlying matrix data, as I mentioned all values are local and no off-processor values ( stash is 0 )</div><div>[0] VecAssemblyBegin_MPI_BTS(): Stash has 0 entries, uses 0 mallocs.<br>[0] VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 mallocs.<br>[0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>[1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>[2] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>[3] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>[4] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>[5] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs.<br>[2] MatAssemblyEnd_SeqAIJ(): Matrix size: 186064 X 186064; storage space: 0 unneeded,743028 used<br>[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 186062; storage space: 0 unneeded,742972 used<br>[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4<br>[1] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows 186062) < 0.6. Do not use CompressedRow routines.<br>[2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4<br>[2] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows 186064) < 0.6. Do not use CompressedRow routines.<br>[4] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 186063; storage space: 0 unneeded,743093 used<br>[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 186062; storage space: 0 unneeded,743036 used<br>[4] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[4] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4<br>[4] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows 186063) < 0.6. Do not use CompressedRow routines.<br>[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[5] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 186062; storage space: 0 unneeded,742938 used<br>[5] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[5] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4<br>[5] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows 186062) < 0.6. Do not use CompressedRow routines.<br>[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4<br>[0] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows 186062) < 0.6. Do not use CompressedRow routines.<br>[3] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 186063; storage space: 0 unneeded,743049 used<br>[3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 4<br>[3] MatCheckCompressedRow(): Found the ratio (num_zerorows 0)/(num_localrows 186063) < 0.6. Do not use CompressedRow routines.<br>[2] MatAssemblyEnd_SeqAIJ(): Matrix size: 186064 X 685; storage space: 0 unneeded,685 used<br>[4] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 649; storage space: 0 unneeded,649 used<br>[4] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[4] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1<br>[4] MatCheckCompressedRow(): Found the ratio (num_zerorows 185414)/(num_localrows 186063) > 0.6. Use CompressedRow routines.<br>[2] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[2] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1<br>[2] MatCheckCompressedRow(): Found the ratio (num_zerorows 185379)/(num_localrows 186064) > 0.6. Use CompressedRow routines.<br>[1] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 1011; storage space: 0 unneeded,1011 used<br>[5] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 1137; storage space: 0 unneeded,1137 used<br>[5] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[5] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1<br>[5] MatCheckCompressedRow(): Found the ratio (num_zerorows 184925)/(num_localrows 186062) > 0.6. Use CompressedRow routines.<br>[1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1<br>[3] MatAssemblyEnd_SeqAIJ(): Matrix size: 186063 X 658; storage space: 0 unneeded,658 used<br>[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 186062 X 648; storage space: 0 unneeded,648 used<br>[1] MatCheckCompressedRow(): Found the ratio (num_zerorows 185051)/(num_localrows 186062) > 0.6. Use CompressedRow routines.<br>[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1<br>[0] MatCheckCompressedRow(): Found the ratio (num_zerorows 185414)/(num_localrows 186062) > 0.6. Use CompressedRow routines.<br>[3] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0<br>[3] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 1<br>[3] MatCheckCompressedRow(): Found the ratio (num_zerorows 185405)/(num_localrows 186063) > 0.6. Use CompressedRow routines.<br></div><div><br></div><div>----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</div><div>Question 3:</div><div>If all matrix and vector inserted data are local, what part of the vec/mat assembly consumes time because matsetvalues and matassembly consume more time than matrix builder<br></div><div>Also this is not just for the first time MAT_FINAL_ASSEMBLY</div><div><br></div><div><br></div><div>For context the matrix in the above is nearly 1Mx1M partitioned over six processes and it was NOT built using DM </div><div><br></div><div>Finally the configure options are:</div><div> </div><div>Configure options:</div><div>PETSC_ARCH=release3 -with-debugging=0 COPTFLAGS="-O3 -march=native -mtune=native" CXXOPTFLAGS="-O3 -march=native -mtune=native" FOPTFLAGS="-O3 -march=native -mtune=native" --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --download-metis --download-hypre<br></div><div><br></div><div>Sorry for such long question and thanks in advance</div><div>Thanks </div><div>M. Kamra</div></div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail-m_9148722191532272093m_3678564853903919150gmail-m_-3682458186440807404m_8116036447877508053gmail-m_5110792644255036960gmail-m_-2909258851924680987gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div><div><br></div><div><a href="http://www.cse.buffalo.edu/~knepley/" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br></div></div></div></div></div></div></div></div>

</blockquote></div>

</blockquote></div>

</blockquote></div></div>

</blockquote></div>

</blockquote></div></div>

</blockquote></div>