[petsc-users] Scalability issue

Matthew Knepley knepley at gmail.com
Thu Aug 20 10:17:26 CDT 2015


On Thu, Aug 20, 2015 at 6:30 AM, Nelson Filipe Lopes da Silva <
nelsonflsilva at ist.utl.pt> wrote:

> Hello.
>
> I am sorry for the long time without response. I decided to rewrite my
> application in a different way and will send the log_summary output when
> done reimplementing.
>
> As for the machine, I am using mpirun to run jobs in a 8 node cluster. I
> modified the makefile on the steams folder so it would run using my
> hostfile.
> The output is attached to this email. It seems reasonable for a cluster
> with 8 machines. From "lscpu", each machine cpu has 4 cores and 1 socket.
>

1) You launcher is placing processes haphazardly. I would figure out how to
assign them to certain nodes

2) Each node has enough bandwidth for 1 core, so it does not make much
sense to use more than 1.

  Thanks,

    Matt


> Cheers,
> Nelson
>
>
> Em 2015-07-24 16:50, Barry Smith escreveu:
>
>> It would be very helpful if you ran the code on say 1, 2, 4, 8, 16
>> ... processes with the option -log_summary and send (as attachments)
>> the log summary information.
>>
>>    Also on the same machine run the streams benchmark; with recent
>> releases of PETSc you only need to do
>>
>> cd $PETSC_DIR
>> make streams NPMAX=16 (or whatever your largest process count is)
>>
>> and send the output.
>>
>> I suspect that you are doing everything fine and it is more an issue
>> with the configuration of your machine. Also read the information at
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on
>> "binding"
>>
>>   Barry
>>
>> On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva <
>>> nelsonflsilva at ist.utl.pt> wrote:
>>>
>>> Hello,
>>>
>>> I have been using PETSc for a few months now, and it truly is fantastic
>>> piece of software.
>>>
>>> In my particular example I am working with a large, sparse distributed
>>> (MPI AIJ) matrix we can refer as 'G'.
>>> G is a horizontal - retangular matrix (for example, 1,1 Million rows per
>>> 2,1 Million columns). This matrix is commonly very sparse and not diagonal
>>> 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal
>>> block of MPI AIJ representation).
>>> To work with this matrix, I also have a few parallel vectors (created
>>> using MatCreate Vec), we can refer as 'm' and 'k'.
>>> I am trying to parallelize an iterative algorithm in which the most
>>> computational heavy operations are:
>>>
>>> ->Matrix-Vector Multiplication, more precisely G * m + k = b
>>> (MatMultAdd). From what I have been reading, to achive a good speedup in
>>> this operation, G should be as much diagonal as possible, due to
>>> overlapping communication and computation. But even when using a G matrix
>>> in which the diagonal block has ~95% of the nnz, I cannot get a decent
>>> speedup. Most of the times, the performance even gets worse.
>>>
>>> ->Matrix-Matrix Multiplication, in this case I need to perform G * G' =
>>> A, where A is later used on the linear solver and G' is transpose of G. The
>>> speedup in this operation is not worse, although is not very good.
>>>
>>> ->Linear problem solving. Lastly, In this operation I compute "Ax=b"
>>> from the last two operations. I tried to apply a RCM permutation to A to
>>> make it more diagonal, for better performance. However, the problem I faced
>>> was that, the permutation is performed locally in each processor and thus,
>>> the final result is different with different number of processors. I assume
>>> this was intended to reduce communication. The solution I found was
>>> 1-calculate A
>>> 2-calculate, localy to 1 machine, the RCM permutation IS using A
>>> 3-apply this permutation to the lines of G.
>>> This works well, and A is generated as if RCM permuted. It is fine to do
>>> this operation in one machine because it is only done once while reading
>>> the input. The nnz of G become more spread and less diagonal, causing
>>> problems when calculating G * m + k = b.
>>>
>>> These 3 operations (except the permutation) are performed in each
>>> iteration of my algorithm.
>>>
>>> So, my questions are.
>>> -What are the characteristics of G that lead to a good speedup in the
>>> operations I described? Am I missing something and too much obsessed with
>>> the diagonal block?
>>>
>>> -Is there a better way to permute A without permute G and still get the
>>> same result using 1 or N machines?
>>>
>>>
>>> I have been avoiding asking for help for a while. I'm very sorry for the
>>> long email.
>>> Thank you very much for your time.
>>> Best Regards,
>>> Nelson
>>>
>>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150820/e6933e96/attachment-0001.html>


More information about the petsc-users mailing list