<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">Faraz :</div><div class="gmail_quote">The results look reasonable to me. I guess you collect strong speedup, i.e., fixed problem size while increasing cpus. How large is your matrix?</div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Thanks, the solve times are faster after  I tried sequential symbolic factorization instead of parallel. However, they are still slower than Pardiso with 24 cpus ( 120 seconds ). I am not sure if it a configuration issue on my end or a limitation of mumps?<br>

<br></blockquote><div>How do you run  Pardiso with 24 cpus? </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

Is it possible for someone else to solve my matrix to verify they get the same times? If not, I will contact mumps developers to see if I can send them my matrix to benchmark.<br></blockquote><div>mumps developers would give you better suggestions.</div><div><br></div><div>Hong </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

--------------------------------------------<br>

<span class="">On Mon, 6/27/16, Hong <<a href="mailto:hzhang@mcs.anl.gov">hzhang@mcs.anl.gov</a>> wrote:<br>

<br>

 Subject: Re: [petsc-users] Performance of mumps vs. Intel Pardiso<br>

 To: "Faraz Hussain" <<a href="mailto:faraz_hussain@yahoo.com">faraz_hussain@yahoo.com</a>><br>

</span> Cc: "Barry Smith" <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>>, "<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>" <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>><br>

 Date: Monday, June 27, 2016, 8:40 PM<br>

<span class="im"><br>

 Faraz :Direct sparse solvers are<br>

 generally not scalable -- they are used for ill-conditioned<br>

 problems which cannot be solved by iterative<br>

 methods. <br>

 Can<br>

 you try sequential symbolic factorization instead of<br>

 parallel, i.e., use mumps default '-mat_mumps_icntl_28<br>

 1'?<br>

 Hong<br>

</span><div class=""><div class="h5"> Thanks<br>

 for the quick response. Here are the log_summary for 24, 48<br>

 and 72 cpus:<br>

<br>

<br>

<br>

 24 cpus<br>

<br>

 ======<br>

<br>

 MatSolve               1 1.0 1.8100e+00 1.0 0.00e+00<br>

 0.0 7.0e+02 7.4e+04 3.0e+00  0  0 68  3  9   0  0<br>

 68  3  9     0<br>

<br>

 MatCholFctrSym         1 1.0 4.6683e+01 1.0 0.00e+00<br>

 0.0 0.0e+00 0.0e+00 5.0e+00  6  0  0  0 15   6  0 <br>

 0  0 15     0<br>

<br>

 MatCholFctrNum         1 1.0 5.8129e+02 1.0 0.00e+00<br>

 0.0 0.0e+00 0.0e+00 0.0e+00 78  0  0  0  0  78  0 <br>

 0  0  0     0<br>

<br>

<br>

<br>

 48 cpus<br>

<br>

 ======<br>

<br>

 MatSolve               1 1.0 1.4915e+00 1.0 0.00e+00<br>

 0.0 1.6e+03 3.3e+04 3.0e+00  0  0 68  3  9   0  0<br>

 68  3  9     0<br>

<br>

 MatCholFctrSym         1 1.0 5.3486e+01 1.0 0.00e+00<br>

 0.0 0.0e+00 0.0e+00 5.0e+00  9  0  0  0 15   9  0 <br>

 0  0 15     0<br>

<br>

 MatCholFctrNum         1 1.0 4.0803e+02 1.0 0.00e+00<br>

 0.0 0.0e+00 0.0e+00 0.0e+00 71  0  0  0  0  71  0 <br>

 0  0  0     0<br>

<br>

<br>

<br>

 72 cpus<br>

<br>

 ======<br>

<br>

 MatSolve               1<br>

 1.0 7.7200e+00 1.1 0.00e+00 0.0 2.6e+03 2.0e+04 3.0e+00 <br>

 1  0 68  2  9   1  0 68  2  9     0<br>

<br>

 MatCholFctrSym         1 1.0 1.8439e+02 1.0 0.00e+00<br>

 0.0 0.0e+00 0.0e+00 5.0e+00 29  0  0  0 15  29  0  0 <br>

 0 15     0<br>

<br>

 MatCholFctrNum         1 1.0 3.3969e+02 1.0 0.00e+00<br>

 0.0 0.0e+00 0.0e+00 0.0e+00 53  0  0  0  0  53  0 <br>

 0  0  0     0<br>

<br>

<br>

<br>

 Does this look normal or is something off here?<br>

 Regarding reordering algorithm of Pardiso. At this time I do<br>

 not know much about that. I will do some research and see<br>

 what I can learn. However,  I believe Mumps only has two<br>

 options:<br>

<br>

<br>

<br>

         -mat_mumps_icntl_29     - ICNTL(29): parallel<br>

 ordering 1 = ptscotch, 2 = parmetis<br>

<br>

<br>

<br>

 I have tried both and do not see any speed difference. Or<br>

 are you referring to some other kind of reordering?<br>

<br>

<br>

<br>

<br>

<br>

 --------------------------------------------<br>

<br>

 On Mon, 6/27/16, Barry Smith <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>><br>

 wrote:<br>

<br>

<br>

<br>

  Subject: Re: [petsc-users] Performance of mumps vs. Intel<br>

 Pardiso<br>

<br>

  To: "Faraz Hussain" <<a href="mailto:faraz_hussain@yahoo.com">faraz_hussain@yahoo.com</a>><br>

<br>

  Cc: "<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>"<br>

 <<a href="mailto:petsc-users@mcs.anl.gov">petsc-users@mcs.anl.gov</a>><br>

<br>

  Date: Monday, June 27, 2016, 5:50 PM<br>

<br>

<br>

<br>

<br>

<br>

     These are the only lines that<br>

<br>

  matter<br>

<br>

<br>

<br>

  MatSolve       <br>

<br>

               1 1.0 7.7200e+00 1.1 0.00e+00<br>

<br>

  0.0 2.6e+03 2.0e+04 3.0e+00  1  0 68  2 <br>

<br>

  9   1  0 68  2  9     0<br>

<br>

  MatCholFctrSym         1 1.0<br>

<br>

  1.8439e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 5.0e+00 29 <br>

 0 <br>

<br>

  0  0 15  29  0  0  0 15     0<br>

<br>

  MatCholFctrNum         1 1.0<br>

<br>

  3.3969e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 53 <br>

 0 <br>

<br>

  0  0  0  53  0  0  0  0     0<br>

<br>

<br>

<br>

  look at the log summary for 24<br>

<br>

  and 48 processes. How are the symbolic and numeric<br>

 parts<br>

<br>

  scaling with the number of processes?<br>

<br>

<br>

<br>

  Things that could affect the performance a lot.<br>

<br>

  Is the symbolic factorization done in parallel? What<br>

<br>

  reordering is used? If Pardiso is using a reordering that<br>

 is<br>

<br>

  better for this matrix and has (much) lower fill that<br>

 could<br>

<br>

  explain why it is so much faster.<br>

<br>

<br>

<br>

   Perhaps correspond with the MUMPS developers<br>

<br>

  on what MUMPS options might make it faster<br>

<br>

<br>

<br>

    Barry<br>

<br>

<br>

<br>

<br>

<br>

  > On Jun 27, 2016, at 5:39 PM, Faraz Hussain<br>

<br>

  <<a href="mailto:faraz_hussain@yahoo.com">faraz_hussain@yahoo.com</a>><br>

<br>

  wrote:<br>

<br>

  ><br>

<br>

  > I am<br>

<br>

  struggling trying to understand why mumps is so much<br>

 slower<br>

<br>

  than Intel Pardiso solver for my simple test matrix (<br>

<br>

  3million^2 sparse symmetrix matrix with ~1000 non-zero<br>

<br>

  entries per line ).<br>

<br>

  ><br>

<br>

  > My compute nodes have 24 cpus each. Intel<br>

<br>

  Pardiso solves it in in 120 seconds using all 24 cpus of<br>

 one<br>

<br>

  node. With Mumps I get:<br>

<br>

  ><br>

<br>

  > 24 cpus - 765 seconds<br>

<br>

  ><br>

<br>

  48 cpus - 401 seconds<br>

<br>

  > 72 cpus - 344<br>

<br>

  seconds<br>

<br>

  > beyond 72 cpus no speed<br>

<br>

  improvement.<br>

<br>

  ><br>

<br>

  > I am attaching the -log_summary to see if<br>

<br>

  there is something wrong in how I am solving the problem.<br>

 I<br>

<br>

  am really hoping mumps will be faster when using more<br>

 cpus..<br>

<br>

  Otherwise I will have to abort my exploration of<br>

<br>

  mumps!<log_summary.o265103><br>

<br>

<br>

</div></div></blockquote></div><br></div></div>