[petsc-users] Scalability issue

Sat Aug 22 16:17:21 CDT 2015

Hi. 

I managed to finish the re-implementation. I ran the program
with 1,2,3,4,5,6 machines and saved the summary. I send each of them in
this email.
In these executions, the program performs Matrix-Vector
(MatMult, MatMultAdd) products and Vector-Vector operations. From what I
understand while reading the logs, the program takes most of the time in
"VecScatterEnd". 
In this example, the matrix taking part on the
Matrix-Vector products is not "much diagonal heavy". 
The following
numbers are the percentages of nnz values on the matrix diagonal block
for each machine, and each execution time.
NMachines %NNZ ExecTime 
1
machine0 100%; 16min08sec

2 machine0 91.1%; 24min58sec 
 machine1
69.2%; 

3 machine0 90.9% 25min42sec
 machine1 82.8%
 machine2 51.6%

4
machine0 91.9% 26min27sec 
 machine1 82.4%
 machine2 73.1%
 machine3
39.9%

5 machine0 93.2% 39min23sec
 machine1 82.8%
 machine2 74.4%

machine3 64.6%
 machine4 31.6%

6 machine0 94.2% 54min54sec
 machine1
82.6%
 machine2 73.1%
 machine3 65.2%
 machine4 55.9% 
 machine5 25.4%

In this implementation I'm using MatCreate and VecCreate. I'm also
leaving the partition sizes in PETSC_DECIDE. 

Finally, to run the
application, I'm using mpirun.hydra from mpich, downloaded by PETSc
configure script.
I'm checking the process assignment as suggested on
the last email.

Am I missing anything?

Regards,
Nelson 

Em 2015-08-20
16:17, Matthew Knepley escreveu: 

> On Thu, Aug 20, 2015 at 6:30 AM,
Nelson Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt [3]> wrote:
> 
>>
Hello.
>> 
>> I am sorry for the long time without response. I decided
to rewrite my application in a different way and will send the
log_summary output when done reimplementing.
>> 
>> As for the machine,
I am using mpirun to run jobs in a 8 node cluster. I modified the
makefile on the steams folder so it would run using my hostfile.
>> The
output is attached to this email. It seems reasonable for a cluster with
8 machines. From "lscpu", each machine cpu has 4 cores and 1 socket.
>

> 1) You launcher is placing processes haphazardly. I would figure out
how to assign them to certain nodes 
> 2) Each node has enough bandwidth
for 1 core, so it does not make much sense to use more than 1. 
>
Thanks, 
> Matt 
> 
>> Cheers,
>> Nelson
>> 
>> Em 2015-07-24 16:50,
Barry Smith escreveu:
>> 
>>> It would be very helpful if you ran the
code on say 1, 2, 4, 8, 16
>>> ... processes with the option
-log_summary and send (as attachments)
>>> the log summary
information.
>>> 
>>> Also on the same machine run the streams
benchmark; with recent
>>> releases of PETSc you only need to do
>>>

>>> cd $PETSC_DIR
>>> make streams NPMAX=16 (or whatever your largest
process count is)
>>> 
>>> and send the output.
>>> 
>>> I suspect that
you are doing everything fine and it is more an issue
>>> with the
configuration of your machine. Also read the information at
>>>
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers [2] on
>>>
"binding"
>>> 
>>> Barry
>>> 
>>>> On Jul 24, 2015, at 10:41 AM, Nelson
Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt [1]> wrote:
>>>> 
>>>>
Hello,
>>>> 
>>>> I have been using PETSc for a few months now, and it
truly is fantastic piece of software.
>>>> 
>>>> In my particular
example I am working with a large, sparse distributed (MPI AIJ) matrix
we can refer as 'G'.
>>>> G is a horizontal - retangular matrix (for
example, 1,1 Million rows per 2,1 Million columns). This matrix is
commonly very sparse and not diagonal 'heavy' (for example 5,2 Million
nnz in which ~50% are on the diagonal block of MPI AIJ
representation).
>>>> To work with this matrix, I also have a few
parallel vectors (created using MatCreate Vec), we can refer as 'm' and
'k'.
>>>> I am trying to parallelize an iterative algorithm in which the
most computational heavy operations are:
>>>> 
>>>> ->Matrix-Vector
Multiplication, more precisely G * m + k = b (MatMultAdd). From what I
have been reading, to achive a good speedup in this operation, G should
be as much diagonal as possible, due to overlapping communication and
computation. But even when using a G matrix in which the diagonal block
has ~95% of the nnz, I cannot get a decent speedup. Most of the times,
the performance even gets worse.
>>>> 
>>>> ->Matrix-Matrix
Multiplication, in this case I need to perform G * G' = A, where A is
later used on the linear solver and G' is transpose of G. The speedup in
this operation is not worse, although is not very good.
>>>> 
>>>>
->Linear problem solving. Lastly, In this operation I compute "Ax=b"
from the last two operations. I tried to apply a RCM permutation to A to
make it more diagonal, for better performance. However, the problem I
faced was that, the permutation is performed locally in each processor
and thus, the final result is different with different number of
processors. I assume this was intended to reduce communication. The
solution I found was
>>>> 1-calculate A
>>>> 2-calculate, localy to 1
machine, the RCM permutation IS using A
>>>> 3-apply this permutation to
the lines of G.
>>>> This works well, and A is generated as if RCM
permuted. It is fine to do this operation in one machine because it is
only done once while reading the input. The nnz of G become more spread
and less diagonal, causing problems when calculating G * m + k = b.
>>>>

>>>> These 3 operations (except the permutation) are performed in each
iteration of my algorithm.
>>>> 
>>>> So, my questions are.
>>>> -What
are the characteristics of G that lead to a good speedup in the
operations I described? Am I missing something and too much obsessed
with the diagonal block?
>>>> 
>>>> -Is there a better way to permute A
without permute G and still get the same result using 1 or N
machines?
>>>> 
>>>> I have been avoiding asking for help for a while.
I'm very sorry for the long email.
>>>> Thank you very much for your
time.
>>>> Best Regards,
>>>> Nelson
> 
> -- 
> 
> What most
experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments
lead.
> -- Norbert Wiener

Links:
------
[1]
mailto:nelsonflsilva at ist.utl.pt
[2]
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
[3]
mailto:nelsonflsilva at ist.utl.pt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log01P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0006.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log02P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0007.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log03P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0008.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log04P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0009.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log05P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0010.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log06P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0011.ksh>