[petsc-dev] Wrapper for WSMP

Thu Aug 18 17:32:23 CDT 2011

Hello,

On Thu, Aug 18, 2011 at 5:11 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:

> I can clarify a couple of questions re. SuperLU_DIST.
>
> 1) SuperLU does support multiple right-hand sides.  That is, the B matrix
> on the right can be a dense matrix of size n-by-nrhs.   Also, the B matrix
> is distributed among all the processors, each processor takes one block row
> of B.  There is no need to have the entire B on every processor.
>
>
I certainly agree that this is an improvement over MUMPS.

> 2) We are preparing to upgrade to a newer version.  The parallel
> factorization is improved with a better scheduling algorithm.  This is
> particularly effective using larger core count, say in 100s.
>
>
I will be looking forward to the new version. I took a several month hiatus
from trying to use sparse-direct solvers for my subproblems since I wasn't
seeing substantial speedups over the serial algorithm; it was taking a few
hundred cores to solve 3d Helmholtz over 256^3 domains in an hour.

> 3) Regarding memory usage, the factorization algorithm used in SuperLU
> mostly performs "update in-place", and requires just a little bit extra
> working storage.  So, if the load balance is not too bad, the memory per
> core should go down consistently.   It will be helpful if you can give some
> concreate numbers showing how much memory usage increases with increasing
> core count.
>
>
As for specifics, factorizations of 256 x 256 x 10 grids with 7-point
finite-difference stencils required more memory per process when I increased
past ~200 processes. I was storing roughly 50 of these factorizations before
running out of memory, no matter how many more processes I threw at
SuperLU_Dist and MUMPS. I would think that the load balance would be decent
since it is such a regular grid.

Also, if I'm recalling correctly, on 256 cores I was only seeing ~5 GFlops
in each triangle solve against a subdomain, which isn't much of an
improvement over my laptop. Does this sound pathological to you or is it to
be expected from SuperLU_Dist and MUMPS? The linked WSMP paper showed some
impressive triangle solve scalability. Will the new scheduling algorithm
effect the solves as well?

Best Regards,
Jack

>
> On Tue, Aug 16, 2011 at 8:34 PM, Rebecca Yuan <rebeccayxf at gmail.com>wrote:
>
>>
>>
>>
>> Begin forwarded message:
>>
>> *From:* Jack Poulson <jack.poulson at gmail.com>>
>> *Date:* August 16, 2011 10:18:16 PM CDT
>>
>> *To:* For users of the development version of PETSc <
>> petsc-dev at mcs.anl.gov>
>> *Subject:* *Re: [petsc-dev] Wrapper for WSMP*
>> *Reply-To:* For users of the development version of PETSc <
>> petsc-dev at mcs.anl.gov>
>>
>> On Tue, Aug 16, 2011 at 9:35 PM, Barry Smith < <bsmith at mcs.anl.gov>
>> bsmith at mcs.anl.gov> wrote:
>>
>>>
>>> On Aug 16, 2011, at 5:14 PM, Jack Poulson wrote:
>>>
>>> > Hello all,
>>> >
>>> > I am working on a project that requires very fast sparse direct solves
>>> and MUMPS and SuperLU_Dist haven't been cutting it. From what I've read,
>>> when properly tuned, WSMP is significantly faster, particularly with
>>> multiple right-hand sides on large machines. The obvious drawback is that
>>> it's not open source, but the binaries seem to be readily available for most
>>> platforms.
>>> >
>>> > Before I reinvent the wheel, I would like to check if anyone has
>>> already done some work on adding it into PETSc. If not, its interface is
>>> quite similar to MUMPS and I should be able to mirror most of that code. On
>>> the other hand, there are a large number of platform-specific details that
>>> need to be handled, so keeping things both portable and fast might be a
>>> challenge. It seems that the CSC storage format should be used since it is
>>> required for Hermitian matrices.
>>> >
>>> > Thanks,
>>> > Jack
>>>
>>>   Jack,
>>>
>>>   By all means do it. That would be a nice thing to have. But be aware
>>> that the WSMP folks have a reputation for exaggerating how much better their
>>> software is so don't be surprised if after all that work it is not much
>>> better.
>>>
>>>
>> Good to know. I was somewhat worried about that, but perhaps it is a
>> matter of getting all of the tuning parameters right. The manual does
>> mention that performance is significantly degraded without tuning. I would
>> sincerely hope no one would out right lie in their publications, e.g., this
>> one:
>>  <http://portal.acm.org/citation.cfm?id=1654061>
>> http://portal.acm.org/citation.cfm?id=1654061
>>
>>
>>>   BTW: are you solving with many right hand sides? Maybe before you muck
>>> with WSMP we should figure out how to get you access to the multiple right
>>> hand side support of MUMPS (I don't know if SuperLU_Dist has it) so you can
>>> speed up your current computations a good amount? Currently PETSc's
>>> MatMatSolve() calls a separate solve for each right hand side with MUMPS.
>>>
>>>   Barry
>>>
>>>
>> I will eventually need to solve against many right-hand sides, but for now
>> I am solving against one and it is still taking too long; in fact, not only
>> does it take too long, but memory per core increased for fixed problem sizes
>> as I increase the number of MPI processes (for both SuperLU_Dist and MUMPS).
>> This was occurring for quasi2d Helmholtz problems over a couple hundred
>> cores. My only logical explanation for this behavior is that the
>> communication buffers grow proportional to the number of processes on each
>> process, but I stress that this is just a guess. I tried reading through the
>> MUMPS code and quickly gave up.
>>
>> Another problem with MUMPS is that requires the entire set of right-hand
>> sides to reside on the root process...that will clearly not work for a
>> billion degrees of freedom with several hundred RHSs. WSMP gets this part
>> right and actually distributes those vectors.
>>
>> Jack
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110818/eeef4ef2/attachment.html>