[petsc-users] Scalability issue

Sun Aug 23 10:12:22 CDT 2015

Thank you for the fast response!

Yes. The last rows of the matrix are indeed more dense, compared with 
the remaining ones.
For this example, concerning load balance between machines, the last 
process had 46% of the matrix nonzero entries. A few weeks ago I 
suspected of this problem and wrote a little function that could permute 
the matrix rows based on their number of nonzeros. However, the matrix 
would become less pleasant regarding "diagonal block weight", and I stop 
using it as i thought I was becoming worse.

Also, due to this problem, I thought I could have a complete vector 
copy in each processor, instead of a distributed vector. I tried to 
implement this idea, but had no luck with the results. However, even if 
this solution would work, the communication for vector update was 
inevitable once each iteration of my algorithm.
Since this is a rectangular matrix, I cannot apply RCM or such 
permutations, however I can permute rows and columns though.

More specifically, the problem I'm trying to solve is one of balance 
the best guess and uncertainty estimates of a set of Input-Output 
subject to linear constraints and ancillary informations. The matrix is 
called an aggregation matrix, and each entry can be 1, 0 or -1. I don't 
know the cause of its nonzero structure. I'm addressing this problem 
using a weighted least-squares algorithm.

I ran the code with a different, more friendly problem topology, 
logging the load of nonzero entries and the "diagonal load" per 
processor.
I'm sending images of both matrices nonzero structure. The last email 
example used matrix1, the example in this email uses matrix2.
Matrix1 (last email example) is 1.098.939 rows x 2.039.681 columns and 
5.171.901 nnz.
The matrix2 (this email example) is 800.000 rows x 8.800.000 columns 
and 16.800.000 nnz.

With 1,2,3 machines, I have these distributions of nonzeros (using 
matrix2). I'm sending the logs in this email.
1 machine
[0] Matrix diagonal_nnz:16800000 (100.00 %)
[0] Matrix local nnz: 16800000 (100.00 %), local rows: 800000 (100.00 
%)
ExecTime: 4min47sec

2 machines
[0] Matrix diagonal_nnz:4400000 (52.38 %)
[1] Matrix diagonal_nnz:4000000 (47.62 %)

[0] Matrix local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %)
[1] Matrix local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %)
ExecTime: 13min23sec

3 machines
[0] Matrix diagonal_nnz:2933334 (52.38 %)
[1] Matrix diagonal_nnz:533327 (9.52 %)
[2] Matrix diagonal_nnz:2399999 (42.86 %)

[0] Matrix local nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %)
[1] Matrix local nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %)
[2] Matrix local nnz: 5599986 (33.33 %), local rows: 266666 (33.33 %)
ExecTime: 20min26sec

As for the network, I ran the make streams NPMAX=3 again. I'm also 
sending it in this email.

I too think that these bad results are caused by a combination of bad 
matrix structure, especially the "diagonal weight", and maybe network.

I really should find a way to permute these matrices to a more friendly 
structure.

Thank you very much for the help.
Nelson

Em 2015-08-22 22:49, Barry Smith escreveu:
>> On Aug 22, 2015, at 4:17 PM, Nelson Filipe Lopes da Silva 
>> <nelsonflsilva at ist.utl.pt> wrote:
>>
>> Hi.
>>
>>
>> I managed to finish the re-implementation. I ran the program with 
>> 1,2,3,4,5,6 machines and saved the summary. I send each of them in 
>> this email.
>> In these executions, the program performs Matrix-Vector (MatMult, 
>> MatMultAdd) products and Vector-Vector operations. From what I 
>> understand while reading the logs, the program takes most of the time 
>> in "VecScatterEnd".
>> In this example, the matrix taking part on the Matrix-Vector 
>> products is not "much diagonal heavy".
>> The following numbers are the percentages of nnz values on the 
>> matrix diagonal block for each machine, and each execution time.
>> NMachines                      %NNZ       ExecTime
>> 1                   machine0   100%;      16min08sec
>>
>> 2                   machine0   91.1%;     24min58sec
>>                      machine1   69.2%;
>>
>> 3                   machine0   90.9%      25min42sec
>>                      machine1   82.8%
>>                      machine2   51.6%
>>
>> 4                   machine0   91.9%      26min27sec
>>                      machine1   82.4%
>>                      machine2   73.1%
>>                      machine3   39.9%
>>
>> 5                   machine0   93.2%      39min23sec
>>                      machine1   82.8%
>>                      machine2   74.4%
>>                      machine3   64.6%
>>                      machine4   31.6%
>>
>> 6                   machine0   94.2%      54min54sec
>>                      machine1   82.6%
>>                      machine2   73.1%
>>                      machine3   65.2%
>>                      machine4   55.9%
>>                      machine5   25.4%
>
>    Based on this I am guessing the last rows of the matrix have a lot
> of nonzeros away from the diagonal?
>
>    There is a big load imbalance in something: for example with 2
> processes you have
>
> VecMax             10509 1.0 2.0602e+02 4.2 0.00e+00 0.0 0.0e+00
> 0.0e+00 1.1e+04  9  0  0  0 72   9  0  0  0 72     0
> VecScatterEnd      18128 1.0 8.9404e+02 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00 0.0e+00 53  0  0  0  0  53  0  0  0  0     0
> MatMult            10505 1.0 6.5591e+02 1.4 3.16e+10 1.4 2.1e+04
> 1.2e+06 0.0e+00 37 33 58 38  0  37 33 58 38  0    83
> MatMultAdd          7624 1.0 7.0028e+02 2.3 3.26e+10 2.1 1.5e+04
> 2.8e+06 0.0e+00 34 29 42 62  0  34 29 42 62  0    69
>
>   the 5th column has the imbalance between slowest and fastest
> process. It is 4.2 for max, 1.4 for multi and 2.3 for matmultadd, to
> get good speed ups these need to be much closer to 1.
>
>   How many nonzeros in the matrix are there per process? Is it very
> different for difference processes? You really need to have each
> process have similar number of matrix nonzeros.   Do you have a
> picture of the nonzero structure of the matrix?  Where does the 
> matrix
> come from, why does it have this structure?
>
>   Also likely there are just to many vector entries that need to be
> scattered to the last process for the matmults.
>>
>> In this implementation I'm using MatCreate and VecCreate. I'm also 
>> leaving the partition sizes in PETSC_DECIDE.
>>
>> Finally, to run the application, I'm using mpirun.hydra from mpich, 
>> downloaded by PETSc configure script.
>> I'm checking the process assignment as suggested on the last email.
>>
>> Am I missing anything?
>
>   Your network is very poor; likely ethernet. It is had to get much
> speedup with such slow reductions and sends and receives.
>
> Average time to get PetscTime(): 1.19209e-07
> Average time for MPI_Barrier(): 0.000215769
> Average time for zero size MPI_Send(): 5.94854e-05
>
>   I think you are seeing such bad results due to an unkind matrix
> nonzero structure giving per load balance and too much communication
> and a very poor computer network that just makes all the needed
> communication totally dominate.
>
>
>>
>> Regards,
>> Nelson
>>
>> Em 2015-08-20 16:17, Matthew Knepley escreveu:
>>
>>> On Thu, Aug 20, 2015 at 6:30 AM, Nelson Filipe Lopes da Silva 
>>> <nelsonflsilva at ist.utl.pt> wrote:
>>> Hello.
>>>
>>> I am sorry for the long time without response. I decided to rewrite 
>>> my application in a different way and will send the log_summary 
>>> output when done reimplementing.
>>>
>>> As for the machine, I am using mpirun to run jobs in a 8 node 
>>> cluster. I modified the makefile on the steams folder so it would run 
>>> using my hostfile.
>>> The output is attached to this email. It seems reasonable for a 
>>> cluster with 8 machines. From "lscpu", each machine cpu has 4 cores 
>>> and 1 socket.
>>> 1) You launcher is placing processes haphazardly. I would figure 
>>> out how to assign them to certain nodes
>>> 2) Each node has enough bandwidth for 1 core, so it does not make 
>>> much sense to use more than 1.
>>>   Thanks,
>>>     Matt
>>>
>>> Cheers,
>>> Nelson
>>>
>>>
>>> Em 2015-07-24 16:50, Barry Smith escreveu:
>>> It would be very helpful if you ran the code on say 1, 2, 4, 8, 16
>>> ... processes with the option -log_summary and send (as 
>>> attachments)
>>> the log summary information.
>>>
>>>    Also on the same machine run the streams benchmark; with recent
>>> releases of PETSc you only need to do
>>>
>>> cd $PETSC_DIR
>>> make streams NPMAX=16 (or whatever your largest process count is)
>>>
>>> and send the output.
>>>
>>> I suspect that you are doing everything fine and it is more an 
>>> issue
>>> with the configuration of your machine. Also read the information 
>>> at
>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on
>>> "binding"
>>>
>>>   Barry
>>>
>>> On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva 
>>> <nelsonflsilva at ist.utl.pt> wrote:
>>>
>>> Hello,
>>>
>>> I have been using PETSc for a few months now, and it truly is 
>>> fantastic piece of software.
>>>
>>> In my particular example I am working with a large, sparse 
>>> distributed (MPI AIJ) matrix we can refer as 'G'.
>>> G is a horizontal - retangular matrix (for example, 1,1 Million 
>>> rows per 2,1 Million columns). This matrix is commonly very sparse 
>>> and not diagonal 'heavy' (for example 5,2 Million nnz in which ~50% 
>>> are on the diagonal block of MPI AIJ representation).
>>> To work with this matrix, I also have a few parallel vectors 
>>> (created using MatCreate Vec), we can refer as 'm' and 'k'.
>>> I am trying to parallelize an iterative algorithm in which the most 
>>> computational heavy operations are:
>>>
>>> ->Matrix-Vector Multiplication, more precisely G * m + k = b 
>>> (MatMultAdd). From what I have been reading, to achive a good speedup 
>>> in this operation, G should be as much diagonal as possible, due to 
>>> overlapping communication and computation. But even when using a G 
>>> matrix in which the diagonal block has ~95% of the nnz, I cannot get 
>>> a decent speedup. Most of the times, the performance even gets worse.
>>>
>>> ->Matrix-Matrix Multiplication, in this case I need to perform G * 
>>> G' = A, where A is later used on the linear solver and G' is 
>>> transpose of G. The speedup in this operation is not worse, although 
>>> is not very good.
>>>
>>> ->Linear problem solving. Lastly, In this operation I compute 
>>> "Ax=b" from the last two operations. I tried to apply a RCM 
>>> permutation to A to make it more diagonal, for better performance. 
>>> However, the problem I faced was that, the permutation is performed 
>>> locally in each processor and thus, the final result is different 
>>> with different number of processors. I assume this was intended to 
>>> reduce communication. The solution I found was
>>> 1-calculate A
>>> 2-calculate, localy to 1 machine, the RCM permutation IS using A
>>> 3-apply this permutation to the lines of G.
>>> This works well, and A is generated as if RCM permuted. It is fine 
>>> to do this operation in one machine because it is only done once 
>>> while reading the input. The nnz of G become more spread and less 
>>> diagonal, causing problems when calculating G * m + k = b.
>>>
>>> These 3 operations (except the permutation) are performed in each 
>>> iteration of my algorithm.
>>>
>>> So, my questions are.
>>> -What are the characteristics of G that lead to a good speedup in 
>>> the operations I described? Am I missing something and too much 
>>> obsessed with the diagonal block?
>>>
>>> -Is there a better way to permute A without permute G and still get 
>>> the same result using 1 or N machines?
>>>
>>>
>>> I have been avoiding asking for help for a while. I'm very sorry 
>>> for the long email.
>>> Thank you very much for your time.
>>> Best Regards,
>>> Nelson
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which 
>>> their experiments lead.
>>> -- Norbert Wiener
>>
>> 
>> <Log01P.txt><Log02P.txt><Log03P.txt><Log04P.txt><Log05P.txt><Log06P.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log01P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150823/bb91eb4b/attachment-0004.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log02P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150823/bb91eb4b/attachment-0005.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log03P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150823/bb91eb4b/attachment-0006.ksh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix1.png
Type: application/octet-stream
Size: 1936 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150823/bb91eb4b/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix2.png
Type: application/octet-stream
Size: 2058 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150823/bb91eb4b/attachment-0003.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: streams.out
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150823/bb91eb4b/attachment-0007.ksh>