[petsc-users] LU factorization and solution of independent matrices does not scale, why?

Fri Dec 21 15:05:21 CST 2012

So, here it is. Just compile and run with

mpiexec -np 64 ./ex10 -ksp_type preonly -pc_type lu  
-pc_factor_mat_solver_package superlu_dist -log_summary

64 cores: 0.09 seconds for solving
1024 cores: 2.6 seconds for solving

Thomas

Zitat von Jed Brown <jedbrown at mcs.anl.gov>:

> Can you reproduce this in a simpler environment so that we can report it?
> As I understand your statement, it sounds like you could reproduce by
> changing src/ksp/ksp/examples/tutorials/ex10.c to create a subcomm of size
> 4 and the using that everywhere, then compare log_summary running on 4
> cores to running on more (despite everything really being independent)
>
> It would also be worth using an MPI profiler to see if it's really spending
> a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use MPI_Iprobe, it
> may be something else.
>
> On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski <
> Thomas.Witkowski at tu-dresden.de> wrote:
>
>> I use a modified MPICH version. On the system I use for these benchmarks I
>> cannot use another MPI library.
>>
>> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly
>> for this. But there is still the following problem I cannot solve: When I
>> increase the number of coarse space matrices, there seems to be no scaling
>> direct solver for this. Just to summaries:
>> - one coarse space matrix is created always by one "cluster" consisting of
>> four subdomanins/MPI tasks
>> - the four tasks are always local to one node, thus inter-node network
>> communication is not required for computing factorization and solve
>> - independent of the number of cluster, the coarse space matrices are the
>> same, have the same number of rows, nnz structure but possibly different
>> values
>> - there is NO load unbalancing
>> - the matrices must be factorized and there are a lot of solves (> 100)
>> with them
>>
>> It should be pretty clear, that computing LU factorization and solving
>> with it should scale perfectly. But at the moment, all direct solver I
>> tried (mumps, superlu_dist, pastix) are not able to scale. The loos of
>> scale is really worse, as you can see from the numbers I send before.
>>
>> Any ideas? Suggestions? Without a scaling solver method for these kind of
>> systems, my multilevel FETI-DP code is just more or less a joke, only some
>> orders of magnitude slower than standard FETI-DP method :)
>>
>> Thomas
>>
>> Zitat von Jed Brown <jedbrown at mcs.anl.gov>:
>>
>>  MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI
>>> implementation have you been using? Is the behavior different with a
>>> different implementation?
>>>
>>>
>>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski <
>>> thomas.witkowski at tu-dresden.de**> wrote:
>>>
>>>  Okay, I did a similar benchmark now with PETSc's event logging:
>>>>
>>>> UMFPACK
>>>>  16p: Local solve          350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00
>>>> 0.0e+00 7.0e+02 63  0  0  0 52  63  0  0  0 51     0
>>>>  64p: Local solve          350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00
>>>> 0.0e+00 7.0e+02 60  0  0  0 52  60  0  0  0 51     0
>>>> 256p: Local solve          350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00
>>>> 0.0e+00 7.0e+02 49  0  0  0 52  49  0  0  0 51     1
>>>>
>>>> MUMPS
>>>>  16p: Local solve          350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00
>>>> 0.0e+00 7.0e+02 75  0  0  0 52  75  0  0  0 51     0
>>>>  64p: Local solve          350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00
>>>> 0.0e+00 7.0e+02 78  0  0  0 52  78  0  0  0 51     0
>>>> 256p: Local solve          350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00
>>>> 0.0e+00 7.0e+02 82  0  0  0 52  82  0  0  0 51     0
>>>>
>>>>
>>>> As you see, the local solves with UMFPACK have nearly constant time with
>>>> increasing number of subdomains. This is what I expect. The I replace
>>>> UMFPACK by MUMPS and I see increasing time for local solves. In the last
>>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's
>>>> column
>>>> increases here from 75 to 82. What does this mean?
>>>>
>>>> Thomas
>>>>
>>>> Am 21.12.2012 02:19, schrieb Matthew Knepley:
>>>>
>>>>  On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski
>>>>
>>>>> <Thomas.Witkowski at tu-dresden.****de   
>>>>> <Thomas.Witkowski at tu-dresden.**de<Thomas.Witkowski at tu-dresden.de>
>>>>> >>
>>>>>
>>>>> wrote:
>>>>>
>>>>>  I cannot use the information from log_summary, as I have three
>>>>>> different
>>>>>> LU
>>>>>> factorizations and solve (local matrices and two hierarchies of coarse
>>>>>> grids). Therefore, I use the following work around to get the timing of
>>>>>> the
>>>>>> solve I'm intrested in:
>>>>>>
>>>>>>  You misunderstand how to use logging. You just put these thing in
>>>>> separate stages. Stages represent
>>>>> parts of the code over which events are aggregated.
>>>>>
>>>>>     Matt
>>>>>
>>>>>       MPI::COMM_WORLD.Barrier();
>>>>>
>>>>>>      wtime = MPI::Wtime();
>>>>>>      KSPSolve(*(data->ksp_schur_****primal_local), tmp_primal,
>>>>>>
>>>>>> tmp_primal);
>>>>>>      FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime);
>>>>>>
>>>>>> The factorization is done explicitly before with "KSPSetUp", so I can
>>>>>> measure the time for LU factorization. It also does not scale! For 64
>>>>>> cores,
>>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations,
>>>>>> the
>>>>>> local coarse space matrices defined on four cores have exactly the same
>>>>>> number of rows and exactly the same number of non zero entries. So,
>>>>>> from
>>>>>> my
>>>>>> point of view, the time should be absolutely constant.
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>:
>>>>>>
>>>>>>
>>>>>>     Are you timing ONLY the time to factor and solve the subproblems?
>>>>>>  Or
>>>>>>
>>>>>>> also the time to get the data to the collection of 4 cores at a  time?
>>>>>>>
>>>>>>>     If you are only using LU for these problems and not elsewhere in
>>>>>>>  the
>>>>>>> code you can get the factorization and time from MatLUFactor()  and
>>>>>>> MatSolve() or you can use stages to put this calculation in its  own
>>>>>>> stage
>>>>>>> and use the MatLUFactor() and MatSolve() time from that  stage.
>>>>>>> Also look at the load balancing column for the factorization and
>>>>>>>  solve
>>>>>>> stage, it is well balanced?
>>>>>>>
>>>>>>>     Barry
>>>>>>>
>>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski
>>>>>>> <thomas.witkowski at tu-dresden.****de   
>>>>>>> <thomas.witkowski at tu-dresden.**de<thomas.witkowski at tu-dresden.de>
>>>>>>> >>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  In my multilevel FETI-DP code, I have localized course matrices,
>>>>>>>  which
>>>>>>>
>>>>>>>> are defined on only a subset of all MPI tasks, typically  between 4
>>>>>>>> and 64
>>>>>>>> tasks. The MatAIJ and the KSP objects are both  defined on a MPI
>>>>>>>> communicator, which is a subset of  MPI::COMM_WORLD. The LU
>>>>>>>> factorization of
>>>>>>>> the matrices is computed  with either MUMPS or superlu_dist, but both
>>>>>>>> show
>>>>>>>> some scaling  property I really wonder of: When the overall problem
>>>>>>>> size is
>>>>>>>> increased, the solve with the LU factorization of the local  matrices
>>>>>>>> does
>>>>>>>> not scale! But why not? I just increase the number of  local
>>>>>>>> matrices,
>>>>>>>> but
>>>>>>>> all of them are independent of each other. Some  example: I use 64
>>>>>>>> cores,
>>>>>>>> each coarse matrix is spanned by 4 cores  so there are 16 MPI
>>>>>>>> communicators
>>>>>>>> with 16 coarse space matrices.  The problem need to solve 192 times
>>>>>>>> with the
>>>>>>>> coarse space systems,  and this takes together 0.09 seconds. Now I
>>>>>>>> increase
>>>>>>>> the number of  cores to 256, but let the local coarse space be
>>>>>>>> defined
>>>>>>>> again
>>>>>>>> on  only 4 cores. Again, 192 solutions with these coarse spaces are
>>>>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores,
>>>>>>>>  and we
>>>>>>>> are at 1.7 seconds for the local coarse space solver!
>>>>>>>>
>>>>>>>> For me, this is a total mystery! Any idea how to explain, debug and
>>>>>>>> eventually how to resolve this problem?
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which
>>>>> their experiments lead.
>>>>> -- Norbert Wiener
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ex10.c
Type: text/x-c++src
Size: 3496 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121221/77093f71/attachment.c>
-------------- next part --------------
 {P         ½                                                                         	   

   ?î÷ÍãÀ6¿Å@a«Î¨o?ÏÍ¤Ñ/a>|7M¿ £¾
7ò.‰ª¾š
ž¯P¾±[Ø•'î#>Ÿì’ÎBä>Ö"
ÂÒB¾ˆE(@Iœ¾õó·Ì†„¾‹\îlþù¿µY|Î
 ñ?é2ÌP6;f¿Å@a«Î¨f¾¢f
ÿºà>“ïÅRŸNd¾
YLr¾åÁÏ·ô¾¢õ’ƒ&¦Ä>ÏHß‰Uè>”ç/.é&>Öwá~(lÎ¾ˆKæ@Dª?÷ÀU¶Ì¿µY|Î
 M?î÷ÍãÀ>¯*^€7¾¢#À´äÝ>
3Ÿë]>÷Tˆ: ²©>°uŒ¾§h¾åÒ HÿºÈ¾£4gRÃÿ¾°â•œ]#È> 	 Ê‹•>>
3Ÿë¾
YL$¾š
ž°X?î ‚.ŸÊ¿Å¸¿à?Ï¥zœ³Ê¾±3ò>öë>Ÿ°‚
#X>Õþ Þ^¾‡èmÄŠ
?¾õÏÕF.ö¾‹ö´Žn¾¢#À´äÚ>“ïÅRŸN$¾
7ò.ˆz¿µ;šŒE?é)¸jÞ-˜¿Å¸¿à¾å´<ÝV	V¾¢íØcùûj>Ï±`[e{>”ÙgÅ¯†ª>Õ~ÑŽ2ö¾ˆÙ—¹Æº>¯*^€6Ã¾¢f
ÿº
>|7M¿ ?÷½MpmU¿µ;š‹k?î ‚.Ÿ¤>÷WÛÁc^–>°9Zl¾åçœ¶?Ð¾¢ü±?
¾®Sˆ(Æä°>ŸÇö»z–ß¾°â•œ]%
>Öwá~(ln¾õó·Ì†ƒô¾®Sˆ(Æë >Õ~ÑŽ2K¾õÏÕF.Î?þã’D~L¾³Úß7¾@ë¾³ûÕ‚‹1¦¿Õ=A$æ¼	>Ö^`Ÿ—w¥>ÖdåtˆEÕ?ßÌhøMl¾õá}!}š¾õäDÓ
Ø»> 	 Ê‹•4¾ˆKæ@D¿¾‹\îlþŠ¾³Ç›wÂ¬?îúƒìeÞ>Ÿý.ã³0Æ>ÖY@
#T¿Å>Ñú$<¾ˆFÛ5zUÖ¾õá}!}³?ÏÎ_o0Ö¾‹š'=Yr>ŸÇö»z–¾¾ˆÙ—¹Æ—¾‹ö´%¾³þ]òßñÑ>Ÿý-ån¹Z?îú–¸µf>Öa/¬›yÒ¾ˆFÛŽa6
¿Å>ÓiVjü¾õäDÓ
Øº¾‹š'=Y&?ÏÎË¾åÒ Hÿ»
>ÏHß‰V³>Ö"
ÂÑ ¾åçœ¶?>Ï±`[e|6>Õþ Þ]‚¿Å^
G’Çí¾åeàÆ»Æs¾åd"û?ù2ùÏ¸>ÎußrC>Î¦xì‡á¿Õ=A$æ»û>ÖY@
"Ü>Öa/¬›zA¾£4gRÄ*>”ç/.é/¾ˆE(@Ib¾åg_Í]`ž¿µ_=2¢0k¾£	ø xó>Îußq#?é2sé8ÀL>”ú
x8SÜ>Ö^`Ÿ—xE¿Å>Ñú$:¾ˆFÛŽa6B¾¢ü±?=>”ÙgÅ¯†¾‡èmÄŠ
ª¾åcÑÙDt¾£	ø=Îa¿µ_>ˆÿ(>Î¦xì‡z>”ú
x8Sê?é2t_±í(>ÖdåtˆF’¾ˆFÛ5zV¿Å>ÓiVk>÷Tˆ: ²ß¾åÁÏ¸&¾±[Ø•'éˆ>÷WÛÁc^Å¾å´<ÝV	¶¾±3ò>ðˆ@À³Þ‚X¤>öøC‚ò	ú>öö06¸æà¿Å^
G’Ç?¾åg_Í]a
¾åcÑÙDtE?þã’D~&¾³Ç›w»è¾³þ]òßî>°uŒ¾§l¾¢õ’ƒ&¦·>Ÿì’ÎBäu>öøC‚ò	á?÷ÀÑî—¹x>°LÝ=¾åeàÆ»Æl¿µ_=2¢/J¾£	ø=Î`é¾³Úß7¾;?îúƒìe¨>Ÿý-ån¹«>°9ZZ¾¢íØcùû%>Ÿ°‚
#1>öö06¸æë>°LÝF?÷ÀÒ
º#.¾åd"*¾£	ø xÖ¿µ_>ˆÿ&ù¾³ûÕ‚‹,(>Ÿý.ã³1[?îú–¸µ0