<div class="gmail_quote">On Wed, Jun 1, 2011 at 22:43, John Fettig <span dir="ltr">&lt;<a href="mailto:john.fettig@gmail.com">john.fettig@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div id=":3gw">Unless I&#39;m reading it wrong, there are 89M insertions.  This is from<br>

10 timesteps with nonlinear iteration on each timestep,</div></blockquote><div><br></div><div>Needing 27 linear solves per time step would normally be a better place to direct your effort, but strangely, very little of the total run time is currently in the solve.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div id=":3gw"> with ~1.5M<br>

elements. </div></blockquote><div><br></div><div>Is this 1.5M finite elements or dofs or nonzeros in the matrix? Doing more than 1000 SOR sweeps per second would be weird if the matrix is big.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div id=":3gw">

<br>

We had thought about constructing the matrix outside of PETSc and then<br>

passing PETSc pointers to the matrix, but maybe matrix-free would be<br>

better for some of the equations where jacobi preconditioning is<br>

sufficient.</div></blockquote></div><br><div>Have you considered solving the coupled system instead of a bunch of separate systems. Solving the coupled thing allows you to use block matrix formats and gets better data locality. It also reduces the number of synchronization points (parallel reductions and scatters). And it should converge faster on the coupled system. 27 separate solves per time step is really high, and you currently have over 800 reductions per time step. That is going to hurt a lot in parallel, especially for strong scaling.</div>