<p dir="ltr">Hi Thomas,</p>
<p dir="ltr">Network topology is important. Since most machines are not fully connected, random subsets of four processes will become more scattered about the cluster as you increase your total number of processes. </p>
<p dir="ltr">Jack</p>
<div class="gmail_quote">On Dec 20, 2012 12:39 PM, "Thomas Witkowski" <<a href="mailto:Thomas.Witkowski@tu-dresden.de">Thomas.Witkowski@tu-dresden.de</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I cannot use the information from log_summary, as I have three different LU factorizations and solve (local matrices and two hierarchies of coarse grids). Therefore, I use the following work around to get the timing of the solve I'm intrested in:<br>
<br>
MPI::COMM_WORLD.Barrier();<br>
wtime = MPI::Wtime();<br>
KSPSolve(*(data->ksp_schur_<u></u>primal_local), tmp_primal, tmp_primal);<br>
FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);<br>
<br>
The factorization is done explicitly before with "KSPSetUp", so I can measure the time for LU factorization. It also does not scale! For 64 cores, I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations, the local coarse space matrices defined on four cores have exactly the same number of rows and exactly the same number of non zero entries. So, from my point of view, the time should be absolutely constant.<br>
<br>
Thomas<br>
<br>
Zitat von Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Are you timing ONLY the time to factor and solve the subproblems? Or also the time to get the data to the collection of 4 cores at a time?<br>
<br>
If you are only using LU for these problems and not elsewhere in the code you can get the factorization and time from MatLUFactor() and MatSolve() or you can use stages to put this calculation in its own stage and use the MatLUFactor() and MatSolve() time from that stage.<br>
Also look at the load balancing column for the factorization and solve stage, it is well balanced?<br>
<br>
Barry<br>
<br>
On Dec 20, 2012, at 2:16 PM, Thomas Witkowski <<a href="mailto:thomas.witkowski@tu-dresden.de" target="_blank">thomas.witkowski@tu-dresden.<u></u>de</a>> wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
In my multilevel FETI-DP code, I have localized course matrices, which are defined on only a subset of all MPI tasks, typically between 4 and 64 tasks. The MatAIJ and the KSP objects are both defined on a MPI communicator, which is a subset of MPI::COMM_WORLD. The LU factorization of the matrices is computed with either MUMPS or superlu_dist, but both show some scaling property I really wonder of: When the overall problem size is increased, the solve with the LU factorization of the local matrices does not scale! But why not? I just increase the number of local matrices, but all of them are independent of each other. Some example: I use 64 cores, each coarse matrix is spanned by 4 cores so there are 16 MPI communicators with 16 coarse space matrices. The problem need to solve 192 times with the coarse space systems, and this takes together 0.09 seconds. Now I increase the number of cores to 256, but let the local coarse space be defined again on only 4 cores. Again, 192 solutions with these coarse spaces are required, but now this takes 0.24 seconds. The same for 1024 cores, and we are at 1.7 seconds for the local coarse space solver!<br>
<br>
For me, this is a total mystery! Any idea how to explain, debug and eventually how to resolve this problem?<br>
<br>
Thomas<br>
</blockquote>
<br>
<br>
</blockquote>
<br>
<br>
</blockquote></div>