<div dir="ltr">Thanks. I apologize for any stupid assumptions/mistakes. One thing that
became clear to me was to get the machinery working first, improve
later. I respond to each of your questions below them.<br><br><div class="gmail_quote"><div dir="ltr">On Wed, Jul 8, 2015 at 4:27 PM Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Please provide a bit more detail about the operator. Is it a pressure solver for CFD? Cell centered? Does the matrix has a null space of the constant functions?</blockquote><div><br><div>The operator.. yes it's a pressure solver for a PISO/SIMPLE type algorithm. The underlying code is OpenFOAM. <br><br></div><div>Reg. null space of constant functions, let me verify ... Do you mean Ax = b becomes A (x_1 + x_2) = (c+d) where Ax_1 = c
and A x_2 = d where d is constant or zero right? The basic equation is a
poisson equation, contains only the 2nd derivative. So I suppose if the
entire solution is offset by a constant, it would still satisfy the
equations. However, the problem does have Dirichlet boundary conditions
on one boundary. So I'm thinking no null space.<br></div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Is it the same linear system for each "time-step?" in the CFD solver or different?<br></blockquote><div><br></div><div>Different. The structure of the matrix doesn't change every time step (for now), however the values do.<br></div><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
How many iterations is hypre BoomerAMG typically taking?<br></blockquote><div><br></div><div> Around 5 after I relaxed the tolerances (Used to be around 10-13 when I had the tighter tolerances). I solve the system of equations 3 times per "time-step" like here. <br><br>Mat Object: 256 MPI processes<br> type: mpiaij<br> rows=12254823, cols=12254823<br> total: nonzeros=7.65502e+07, allocated nonzeros=1.83822e+08<br> total number of mallocs used during MatSetValues calls =0<br> not using I-node (on process 0) routines<br>simple.corrNonOrtho() = 1 simple.nNonOrthCorr() = 2<br> 1 KSP Residual norm 1.000000000000e+00<br>End residual = 0.000816181009378<br>Number of iterations = 4<br>simple.corrNonOrtho() = 2 simple.nNonOrthCorr() = 2<br> 1 KSP Residual norm 1.000000000000e+00<br>End residual = 0.00118562245298<br>Number of iterations = 3<br>simple.corrNonOrtho() = 3 simple.nNonOrthCorr() = 2<br> 1 KSP Residual norm 1.000000000000e+00<br>End residual = 1.15194716042e-06<br>Number of iterations = 5<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
This is an extremely tight tolerance, why do you set it so small? KSPSetTolerances(ksp, 1e-13, 1e-13, PETSC_DEFAULT, PETSC_DEFAULT) ;<br></blockquote><div><br></div><div>Well, that was just a first start.. wanted to make sure it worked.. But reducing the required tolerance to 1e-5, 1e-5 also doesn't change the run time by much. 5 "time-steps" now take 250s compared to 260s earlier. As a comparison, I can use OpenFOAM's inbuilt PCG solver + DIC preconditioner to do the same in about 13s (Thirteen). Sure the convergence tolerances are not quite the same, but I don't think that's the issue here. I'm trying to do one better OpenFOAM's existing solver suite. I think a good first step would be to match it. Technically I'm comparing apples and oranges by comparing PCG + DIC vs. FGMRES + AMG, however, I was hoping I could better. I'm open to changing pretty much anything.<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Send the output from a run with -log_summary so we can see where the time is being spent.<br></blockquote><div><br></div><div>Thanks for this. This is below.<br><br>************************************************************************************************************************<br>*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***<br>************************************************************************************************************************<br><br>---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br><br>/scratch/02504/ganesh10/OpenFOAM/ganesh10-2.1.x/platforms/linux64IccDPOpt/bin/SRFSimpleFoamPETSC on a sandybridge named <a href="http://c415-502.stampede.tacc.utexas.edu">c415-502.stampede.tacc.utexas.edu</a> with 256 processors, by ganesh10 Thu<br> Jul 9 13:29:19 2015<br>Using Petsc Release Version 3.5.3, Jan, 31, 2015<br><br> Max Max/Min Avg Total<br>Time (sec): 2.473e+02 1.00143 2.472e+02<br>Objects: 5.210e+02 1.00000 5.210e+02<br>Flops: 1.256e+08 1.17291 1.184e+08 3.031e+10<br>Flops/sec: 5.086e+05 1.17366 4.790e+05 1.226e+08<br>MPI Messages: 3.956e+03 21.50000 1.167e+03 2.986e+05<br>MPI Message Lengths: 6.769e+06 4.61934 3.787e+03 1.131e+09<br>MPI Reductions: 3.680e+02 1.00000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flops<br> and VecAXPY() for complex vectors of length N --> 8N flops<br><br>Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total counts %Total Avg %Total counts %Total<br> 0: Main Stage: 2.4720e+02 100.0% 3.0311e+10 100.0% 2.986e+05 100.0% 3.787e+03 100.0% 3.670e+02 99.7%<br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flops: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> Avg. len: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %F - percent flops in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flops --- Global --- --- Stage --- Total<br> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>------------------------------------------------------------------------------------------------------------------------<br><br>Memory usage is given in bytes:<br><br>Object Type Creations Destructions Memory Descendants' Mem.<br>Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Vector 484 482 184328584 0<br> Vector Scatter 1 1 1060 0<br> Matrix 3 3 10175808 0<br> Index Set 2 2 38884 0<br> Viewer 1 0 0 0<br> Krylov Solver 15 15 284400 0<br> Preconditioner 15 15 16440 0<br>========================================================================================================================<br>Average time to get PetscTime(): 9.53674e-08<br>Average time for MPI_Barrier(): 1.64032e-05<br>Average time for zero size MPI_Send(): 0.000232594<br>#PETSc Option Table entries:<br>-info blah<br>-log_summary<br>-mat_view ::ascii_info<br>-parallel<br>#End of PETSc Option Table entries<br>Compiled without FORTRAN kernels<br>Compiled with full precision matrices (default)<br>sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>Configure options: --with-x=0 -with-pic --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9 --with-scalar-type=real --with-shared-libraries=1 --with-precision=double --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1 --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1 --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-spai=1 --download-spai --with-mumps=1 --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1 --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1 --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9 --with-debugging=no --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=<br>-----------------------------------------<br>Libraries compiled on Thu Apr 2 10:06:57 2015 on <a href="http://staff.stampede.tacc.utexas.edu">staff.stampede.tacc.utexas.edu</a><br>Machine characteristics: Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final<br>Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5<br>Using PETSc arch: sandybridge<br>-----------------------------------------<br><br>Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc -fPIC -wd1572 ${COPTFLAGS} ${CFLAGS}<br>Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90 -fPIC ${FOPTFLAGS} ${FFLAGS}<br>-----------------------------------------<br><br>Using include paths: -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include -I/opt/apps/intel13/mvapich2/1.9/include<br><br><br></div><div>ganesh<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Barry<br>
<br>
<br>
<br>
> On Jul 8, 2015, at 11:11 AM, Ganesh Vijayakumar <<a href="mailto:ganesh.iitm@gmail.com" target="_blank">ganesh.iitm@gmail.com</a>> wrote:<br>
><br>
> Hello,<br>
><br>
> First of all..thanks to the PETSC developers and everyone else contributing supporting material on the web.<br>
><br>
> I need to solve a system of equations Ax =b, where A is symmetric, sparse, unstructured and in parallel as a part of a finite volume solver for CFD. Decomposition is already done. I initialize the matrix as MPIAIJ even though it's symmetric as that's the only thing I could get to work.<br>
><br>
> MatCreate(PETSC_COMM_WORLD,&A);<br>
> MatSetType(A,MATMPIAIJ);<br>
> MatSetSizes(A,nCellsCurProc,nCellsCurProc,nTotalCells,nTotalCells);<br>
> MatMPIAIJSetPreallocation(A,10,PETSC_NULL,5,PETSC_NULL);<br>
> MatSetUp(A) ;<br>
> // Assemble matrix using MatSetValue.. assemble full matrix even though it's symmetric.<br>
> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY) ;<br>
> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY) ;<br>
> //Create solver<br>
> KSPCreate(PETSC_COMM_WORLD,&ksp);<br>
> KSPSetType(ksp,KSPFGMRES;)<br>
> KSPGetPC(ksp, &pc);<br>
> PCFactorSetShiftType(pc, MAT_SHIFT_POSITIVE_DEFINITE);<br>
><br>
> //PCSetType(pc, PCML); //Trilinos - Multilevel<br>
><br>
> //PCSetType(pc, PCBJACOBI) ; //Incomplete LU<br>
> //PetscOptionsSetValue("-sub_pc_type", "ilu") ;<br>
> KSPSetFromOptions(ksp);<br>
><br>
> // PCSetType(pc, PCASM); //Additive Schwarz Methods<br>
> // PCASMSetOverlap(pc,overlap);<br>
><br>
> PCSetType(pc, PCHYPRE); //HYPRE<br>
> PCHYPRESetType(pc, "boomeramg");<br>
><br>
> //PCHYPRESetType(pc, "parasails");<br>
><br>
> KSPSetPC(ksp, pc);<br>
> KSPSetTolerances(ksp, 1e-13, 1e-13, PETSC_DEFAULT, PETSC_DEFAULT) ;<br>
> KSPSetInitialGuessNonzero(ksp, PETSC_TRUE);<br>
> KSPSetOperators(ksp, A, A);<br>
> KSPMonitorDefault(ksp, 1, 1, pVoid) ;<br>
><br>
> KSPSolve(ksp,b,x);<br>
> VecGetArray(x, &getResultArray) ;<br>
> KSPGetResidualNorm(ksp, &endNorm) ;<br>
> KSPGetIterationNumber(ksp, &nIterations);<br>
><br>
> KSPDestroy(&ksp);<br>
><br>
><br>
> I know that this works and gives me the correct result. However, boomerAMG is proving to be too slow.. atleast the way I use it at 256 processors with around 12 million unknowns. I need your advice on the preconditioner. Am I using it right? Is there anything else I could I be doing better?<br>
><br>
> ganesh<br>
<br>
</blockquote></div></div>