<div dir="ltr">Can you reproduce this in a simpler environment so that we can report it? As I understand your statement, it sounds like you could reproduce by changing src/ksp/ksp/examples/tutorials/ex10.c to create a subcomm of size 4 and the using that everywhere, then compare log_summary running on 4 cores to running on more (despite everything really being independent)<br>

<div class="gmail_extra"><br></div><div class="gmail_extra">It would also be worth using an MPI profiler to see if it's really spending a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use MPI_Iprobe, it may be something else.<br>

<br><div class="gmail_quote">On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski <span dir="ltr"><<a href="mailto:Thomas.Witkowski@tu-dresden.de" target="_blank">Thomas.Witkowski@tu-dresden.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I use a modified MPICH version. On the system I use for these benchmarks I cannot use another MPI library.<br>

<br>

I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly for this. But there is still the following problem I cannot solve: When I increase the number of coarse space matrices, there seems to be no scaling direct solver for this. Just to summaries:<br>


- one coarse space matrix is created always by one "cluster" consisting of four subdomanins/MPI tasks<br>

- the four tasks are always local to one node, thus inter-node network communication is not required for computing factorization and solve<br>

- independent of the number of cluster, the coarse space matrices are the same, have the same number of rows, nnz structure but possibly different values<br>

- there is NO load unbalancing<br>

- the matrices must be factorized and there are a lot of solves (> 100) with them<br>

<br>

It should be pretty clear, that computing LU factorization and solving with it should scale perfectly. But at the moment, all direct solver I tried (mumps, superlu_dist, pastix) are not able to scale. The loos of scale is really worse, as you can see from the numbers I send before.<br>


<br>

Any ideas? Suggestions? Without a scaling solver method for these kind of systems, my multilevel FETI-DP code is just more or less a joke, only some orders of magnitude slower than standard FETI-DP method :)<br>

<br>

Thomas<br>

<br>

Zitat von Jed Brown <<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>>:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">

MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI<br>

implementation have you been using? Is the behavior different with a<br>

different implementation?<br>

<br>

<br>

On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <<br>

<a href="mailto:thomas.witkowski@tu-dresden.de" target="_blank">thomas.witkowski@tu-dresden.de</a><u></u>> wrote:<br>

<br>

</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">

Okay, I did a similar benchmark now with PETSc's event logging:<br>

<br>

UMFPACK<br>

 16p: Local solve          350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00<br>

0.0e+00 7.0e+02 63  0  0  0 52  63  0  0  0 51     0<br>

 64p: Local solve          350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00<br>

0.0e+00 7.0e+02 60  0  0  0 52  60  0  0  0 51     0<br>

256p: Local solve          350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00<br>

0.0e+00 7.0e+02 49  0  0  0 52  49  0  0  0 51     1<br>

<br>

MUMPS<br>

 16p: Local solve          350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00<br>

0.0e+00 7.0e+02 75  0  0  0 52  75  0  0  0 51     0<br>

 64p: Local solve          350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00<br>

0.0e+00 7.0e+02 78  0  0  0 52  78  0  0  0 51     0<br>

256p: Local solve          350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00<br>

0.0e+00 7.0e+02 82  0  0  0 52  82  0  0  0 51     0<br>

<br>

<br>

As you see, the local solves with UMFPACK have nearly constant time with<br>

increasing number of subdomains. This is what I expect. The I replace<br>

UMFPACK by MUMPS and I see increasing time for local solves. In the last<br>

columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's column<br>

increases here from 75 to 82. What does this mean?<br>

<br>

Thomas<br>

<br>

Am 21.12.2012 02:19, schrieb Matthew Knepley:<br>

<br>

 On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski<br>

</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<Thomas.Witkowski@tu-dresden.*<u></u>*de <<a href="mailto:Thomas.Witkowski@tu-dresden.de" target="_blank">Thomas.Witkowski@tu-dresden.<u></u>de</a>>><div class="im"><br>

wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I cannot use the information from log_summary, as I have three different<br>

LU<br>

factorizations and solve (local matrices and two hierarchies of coarse<br>

grids). Therefore, I use the following work around to get the timing of<br>

the<br>

solve I'm intrested in:<br>

<br>

</blockquote>

You misunderstand how to use logging. You just put these thing in<br>

separate stages. Stages represent<br>

parts of the code over which events are aggregated.<br>

<br>

    Matt<br>

<br>

      MPI::COMM_WORLD.Barrier();<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

     wtime = MPI::Wtime();<br>

     KSPSolve(*(data->ksp_schur_**<u></u>primal_local), tmp_primal,<div><div class="h5"><br>

tmp_primal);<br>

     FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);<br>

<br>

The factorization is done explicitly before with "KSPSetUp", so I can<br>

measure the time for LU factorization. It also does not scale! For 64<br>

cores,<br>

I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations,<br>

the<br>

local coarse space matrices defined on four cores have exactly the same<br>

number of rows and exactly the same number of non zero entries. So, from<br>

my<br>

point of view, the time should be absolutely constant.<br>

<br>

Thomas<br>

<br>

Zitat von Barry Smith <<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>:<br>

<br>

<br>

    Are you timing ONLY the time to factor and solve the subproblems?  Or<br>

</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">

also the time to get the data to the collection of 4 cores at a  time?<br>

<br>

    If you are only using LU for these problems and not elsewhere in<br>

 the<br>

code you can get the factorization and time from MatLUFactor()  and<br>

MatSolve() or you can use stages to put this calculation in its  own<br>

stage<br>

and use the MatLUFactor() and MatSolve() time from that  stage.<br>

Also look at the load balancing column for the factorization and  solve<br>

stage, it is well balanced?<br>

<br>

    Barry<br>

<br>

On Dec 20, 2012, at 2:16 PM, Thomas Witkowski<br></div></div>

<thomas.witkowski@tu-dresden.*<u></u>*de <<a href="mailto:thomas.witkowski@tu-dresden.de" target="_blank">thomas.witkowski@tu-dresden.<u></u>de</a>>><div><div class="h5"><br>

wrote:<br>

<br>

 In my multilevel FETI-DP code, I have localized course matrices,  which<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

are defined on only a subset of all MPI tasks, typically  between 4<br>

and 64<br>

tasks. The MatAIJ and the KSP objects are both  defined on a MPI<br>

communicator, which is a subset of  MPI::COMM_WORLD. The LU<br>

factorization of<br>

the matrices is computed  with either MUMPS or superlu_dist, but both<br>

show<br>

some scaling  property I really wonder of: When the overall problem<br>

size is<br>

increased, the solve with the LU factorization of the local  matrices<br>

does<br>

not scale! But why not? I just increase the number of  local matrices,<br>

but<br>

all of them are independent of each other. Some  example: I use 64<br>

cores,<br>

each coarse matrix is spanned by 4 cores  so there are 16 MPI<br>

communicators<br>

with 16 coarse space matrices.  The problem need to solve 192 times<br>

with the<br>

coarse space systems,  and this takes together 0.09 seconds. Now I<br>

increase<br>

the number of  cores to 256, but let the local coarse space be defined<br>

again<br>

on  only 4 cores. Again, 192 solutions with these coarse spaces are<br>

required, but now this takes 0.24 seconds. The same for 1024 cores,<br>

 and we<br>

are at 1.7 seconds for the local coarse space solver!<br>

<br>

For me, this is a total mystery! Any idea how to explain, debug and<br>

eventually how to resolve this problem?<br>

<br>

Thomas<br>

<br>

</blockquote>

<br>

<br>

<br>

</div></div></blockquote>

<br>

</blockquote><div><div class="h5">

<br>

--<br>

What most experimenters take for granted before they begin their<br>

experiments is infinitely more interesting than any results to which<br>

their experiments lead.<br>

-- Norbert Wiener<br>

<br>

</div></div></blockquote>

<br>

<br>

</blockquote>

<br>

</blockquote>

<br>

<br>

</blockquote></div><br></div></div>