[petsc-dev] Wrapper for WSMP

Thu Aug 18 19:07:00 CDT 2011

MatMatSolve() is supported by the petsc-dev and superlu_dist interface.
Hong

On Thu, Aug 18, 2011 at 5:32 PM, Jack Poulson <jack.poulson at gmail.com> wrote:
> Hello,
>
> On Thu, Aug 18, 2011 at 5:11 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
>>
>> I can clarify a couple of questions re. SuperLU_DIST.
>>
>> 1) SuperLU does support multiple right-hand sides.  That is, the B matrix
>> on the right can be a dense matrix of size n-by-nrhs.   Also, the B matrix
>> is distributed among all the processors, each processor takes one block row
>> of B.  There is no need to have the entire B on every processor.
>>
>
> I certainly agree that this is an improvement over MUMPS.
>
>>
>> 2) We are preparing to upgrade to a newer version.  The parallel
>> factorization is improved with a better scheduling algorithm.  This is
>> particularly effective using larger core count, say in 100s.
>>
>
> I will be looking forward to the new version. I took a several month hiatus
> from trying to use sparse-direct solvers for my subproblems since I wasn't
> seeing substantial speedups over the serial algorithm; it was taking a few
> hundred cores to solve 3d Helmholtz over 256^3 domains in an hour.
>
>>
>> 3) Regarding memory usage, the factorization algorithm used in SuperLU
>> mostly performs "update in-place", and requires just a little bit extra
>> working storage.  So, if the load balance is not too bad, the memory per
>> core should go down consistently.   It will be helpful if you can give some
>> concreate numbers showing how much memory usage increases with increasing
>> core count.
>>
>
> As for specifics, factorizations of 256 x 256 x 10 grids with 7-point
> finite-difference stencils required more memory per process when I increased
> past ~200 processes. I was storing roughly 50 of these factorizations before
> running out of memory, no matter how many more processes I threw at
> SuperLU_Dist and MUMPS. I would think that the load balance would be decent
> since it is such a regular grid.
> Also, if I'm recalling correctly, on 256 cores I was only seeing ~5 GFlops
> in each triangle solve against a subdomain, which isn't much of an
> improvement over my laptop. Does this sound pathological to you or is it to
> be expected from SuperLU_Dist and MUMPS? The linked WSMP paper showed some
> impressive triangle solve scalability. Will the new scheduling algorithm
> effect the solves as well?
> Best Regards,
> Jack
>>
>>
>> On Tue, Aug 16, 2011 at 8:34 PM, Rebecca Yuan <rebeccayxf at gmail.com>
>> wrote:
>>>
>>>
>>>
>>> Begin forwarded message:
>>>
>>> From: Jack Poulson>
>>> Date: August 16, 2011 10:18:16 PM CDT
>>> To: For users of the development version of PETSc <petsc-dev at mcs.anl.gov>
>>> Subject: Re: [petsc-dev] Wrapper for WSMP
>>> Reply-To: For users of the development version of PETSc
>>> <petsc-dev at mcs.anl.gov>
>>>
>>> On Tue, Aug 16, 2011 at 9:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>
>>>> On Aug 16, 2011, at 5:14 PM, Jack Poulson wrote:
>>>>
>>>> > Hello all,
>>>> >
>>>> > I am working on a project that requires very fast sparse direct solves
>>>> > and MUMPS and SuperLU_Dist haven't been cutting it. From what I've read,
>>>> > when properly tuned, WSMP is significantly faster, particularly with
>>>> > multiple right-hand sides on large machines. The obvious drawback is that
>>>> > it's not open source, but the binaries seem to be readily available for most
>>>> > platforms.
>>>> >
>>>> > Before I reinvent the wheel, I would like to check if anyone has
>>>> > already done some work on adding it into PETSc. If not, its interface is
>>>> > quite similar to MUMPS and I should be able to mirror most of that code. On
>>>> > the other hand, there are a large number of platform-specific details that
>>>> > need to be handled, so keeping things both portable and fast might be a
>>>> > challenge. It seems that the CSC storage format should be used since it is
>>>> > required for Hermitian matrices.
>>>> >
>>>> > Thanks,
>>>> > Jack
>>>>
>>>>  Jack,
>>>>
>>>>   By all means do it. That would be a nice thing to have. But be aware
>>>> that the WSMP folks have a reputation for exaggerating how much better their
>>>> software is so don't be surprised if after all that work it is not much
>>>> better.
>>>>
>>>
>>> Good to know. I was somewhat worried about that, but perhaps it is a
>>> matter of getting all of the tuning parameters right. The manual does
>>> mention that performance is significantly degraded without tuning. I would
>>> sincerely hope no one would out right lie in their publications, e.g., this
>>> one:
>>> http://portal.acm.org/citation.cfm?id=1654061
>>>
>>>>
>>>>   BTW: are you solving with many right hand sides? Maybe before you muck
>>>> with WSMP we should figure out how to get you access to the multiple right
>>>> hand side support of MUMPS (I don't know if SuperLU_Dist has it) so you can
>>>> speed up your current computations a good amount? Currently PETSc's
>>>> MatMatSolve() calls a separate solve for each right hand side with MUMPS.
>>>>
>>>>   Barry
>>>>
>>>
>>> I will eventually need to solve against many right-hand sides, but for
>>> now I am solving against one and it is still taking too long; in fact, not
>>> only does it take too long, but memory per core increased for fixed problem
>>> sizes as I increase the number of MPI processes (for both SuperLU_Dist and
>>> MUMPS). This was occurring for quasi2d Helmholtz problems over a couple
>>> hundred cores. My only logical explanation for this behavior is that the
>>> communication buffers grow proportional to the number of processes on each
>>> process, but I stress that this is just a guess. I tried reading through the
>>> MUMPS code and quickly gave up.
>>> Another problem with MUMPS is that requires the entire set of right-hand
>>> sides to reside on the root process...that will clearly not work for a
>>> billion degrees of freedom with several hundred RHSs. WSMP gets this part
>>> right and actually distributes those vectors.
>>> Jack
>
>