[petsc-users] Scalability issue

Mon Aug 24 11:08:48 CDT 2015

I understand. 

That was indeed the case. I have been experimenting
with different values and thresholds. The program was indeed oversolving
due to severe low threshold values. 
Now all executions run for the same
number of iterations.

The computational part of the program seems to be
showing some speedup! The program was indeed suffering from the poor
matrix structure and the problem was solved with the suggested
permutation. 

I'll keep experimenting with different matrices to figure
out the best permutations for each case. 

Thank you very much for your
time! 

Best regards, 

Nelson 

Em 2015-08-24 15:28, Matthew Knepley
escreveu: 

> On Mon, Aug 24, 2015 at 9:24 AM, Nelson Filipe Lopes da
Silva <nelsonflsilva at ist.utl.pt [5]> wrote:
> 
>> Hello. Thank you very
much for your time.
>> 
>> I understood the idea, it works very well.
>>
I also noticed that my algorithm performs a different number of
iterations with different number of machines. The stop conditions are
calculated using PETSc "matmultadd". I'm very positive that there may be
a program bug in my code, or could it be something with PETSc?
> 
> In
parallel, a total order on summation is not guaranteed, and thus you
will have jitter in the result. However, your 
> iteration seems
extremely sensitive to this (10s of iterations difference). Thus it
seems that either your iterative 
> tolerance is down around round
error, which is usually oversolving, or you have an incredibly
ill-conditioned system. 
> Thanks, 
> Matt 
> 
>> I also need to figure
out why those vecmax ratio are so high. The vecset is understandable as
I'm distributing the initial information from the root machine in
sequencial.
>> 
>> These are the new values:
>> 1 machine
>> [0] Matrix
diagonal_nnz:16800000 (100.00 %)
>> [0] Matrix local nnz: 16800000
(100.00 %), local rows: 800000 (100.00 %)
>> ExecTime: 4min47sec
>>
Iterations: 236
>> 
>> 2 machines
>> [0] Matrix diagonal_nnz:8000000
(95.24 %)
>> [1] Matrix diagonal_nnz:7600000 (90.48 %)
>> 
>> [0] Matrix
local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %)
>> [1] Matrix
local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %)
>> ExecTime:
5min26sec
>> Iterations: 330
>> 
>> 3 machines
>> [0] Matrix
diagonal_nnz:5333340 (95.24 %)
>> [1] Matrix diagonal_nnz:4800012 (85.71
%)
>> [2] Matrix diagonal_nnz:4533332 (80.95 %)
>> 
>> [0] Matrix local
nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %)
>> [1] Matrix local
nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %)
>> [2] Matrix local
nnz: 5599986 (33.33 %), local rows: 266666 (33.33 %))
>> ExecTime:
5min25sec
>> Iterations: 346
>> 
>> The suggested permutation worked
very well in comparison with the original matrix structure. The
no-speedup may be related with the different number of iterations.
>>

>> Once again, thank you very much for the time.
>> Cheers,
>> Nelson

>> 
>> Em 2015-08-23 20:19, Barry Smith escreveu:
>> 
>>> A suggestion:
take your second ordering and now interlace the second
>>> half of the
rows with the first half of the rows (keeping the some
>>> column
ordering) That is, order the rows 0, n/2, 1, n/2+1, 2, n/2+2
>>> etc
this will take the two separate "diagonal" bands and form a
>>> single
"diagonal band". This will increase the "diagonal block
>>> weight" to
be pretty high and the only scatters will need to be for
>>> the final
rows of the input vector that all processes need to do their
>>> part of
the multiply. Generate the image to make sure what I suggest
>>> make
sense and then run this ordering with 1, 2, and 3 processes. Send
>>>
the logs.
>>> 
>>> Barry
>>> 
>>>> On Aug 23, 2015, at 10:12 AM, Nelson
Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt [4]> wrote:
>>>> 
>>>>
Thank you for the fast response!
>>>> 
>>>> Yes. The last rows of the
matrix are indeed more dense, compared with the remaining ones.
>>>> For
this example, concerning load balance between machines, the last process
had 46% of the matrix nonzero entries. A few weeks ago I suspected of
this problem and wrote a little function that could permute the matrix
rows based on their number of nonzeros. However, the matrix would become
less pleasant regarding "diagonal block weight", and I stop using it as
i thought I was becoming worse.
>>>> 
>>>> Also, due to this problem, I
thought I could have a complete vector copy in each processor, instead
of a distributed vector. I tried to implement this idea, but had no luck
with the results. However, even if this solution would work, the
communication for vector update was inevitable once each iteration of my
algorithm.
>>>> Since this is a rectangular matrix, I cannot apply RCM
or such permutations, however I can permute rows and columns
though.
>>>> 
>>>> More specifically, the problem I'm trying to solve is
one of balance the best guess and uncertainty estimates of a set of
Input-Output subject to linear constraints and ancillary informations.
The matrix is called an aggregation matrix, and each entry can be 1, 0
or -1. I don't know the cause of its nonzero structure. I'm addressing
this problem using a weighted least-squares algorithm.
>>>> 
>>>> I ran
the code with a different, more friendly problem topology, logging the
load of nonzero entries and the "diagonal load" per processor.
>>>> I'm
sending images of both matrices nonzero structure. The last email
example used matrix1, the example in this email uses matrix2.
>>>>
Matrix1 (last email example) is 1.098.939 rows x 2.039.681 columns and
5.171.901 nnz.
>>>> The matrix2 (this email example) is 800.000 rows x
8.800.000 columns and 16.800.000 nnz.
>>>> 
>>>> With 1,2,3 machines, I
have these distributions of nonzeros (using matrix2). I'm sending the
logs in this email.
>>>> 1 machine
>>>> [0] Matrix diagonal_nnz:16800000
(100.00 %)
>>>> [0] Matrix local nnz: 16800000 (100.00 %), local rows:
800000 (100.00 %)
>>>> ExecTime: 4min47sec
>>>> 
>>>> 2 machines
>>>>
[0] Matrix diagonal_nnz:4400000 (52.38 %)
>>>> [1] Matrix
diagonal_nnz:4000000 (47.62 %)
>>>> 
>>>> [0] Matrix local nnz: 8400000
(50.00 %), local rows: 400000 (50.00 %)
>>>> [1] Matrix local nnz:
8400000 (50.00 %), local rows: 400000 (50.00 %)
>>>> ExecTime:
13min23sec
>>>> 
>>>> 3 machines
>>>> [0] Matrix diagonal_nnz:2933334
(52.38 %)
>>>> [1] Matrix diagonal_nnz:533327 (9.52 %)
>>>> [2] Matrix
diagonal_nnz:2399999 (42.86 %)
>>>> 
>>>> [0] Matrix local nnz: 5600007
(33.33 %), local rows: 266667 (33.33 %)
>>>> [1] Matrix local nnz:
5600007 (33.33 %), local rows: 266667 (33.33 %)
>>>> [2] Matrix local
nnz: 5599986 (33.33 %), local rows: 266666 (33.33 %)
>>>> ExecTime:
20min26sec
>>>> 
>>>> As for the network, I ran the make streams NPMAX=3
again. I'm also sending it in this email.
>>>> 
>>>> I too think that
these bad results are caused by a combination of bad matrix structure,
especially the "diagonal weight", and maybe network.
>>>> 
>>>> I really
should find a way to permute these matrices to a more friendly
structure.
>>>> 
>>>> Thank you very much for the help.
>>>> Nelson
>>>>

>>>> Em 2015-08-22 22:49, Barry Smith escreveu:
>>>> 
>>>>>> On Aug 22,
2015, at 4:17 PM, Nelson Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt
[1]> wrote:
>>>>>> 
>>>>>> Hi.
>>>>>> 
>>>>>> I managed to finish the
re-implementation. I ran the program with 1,2,3,4,5,6 machines and saved
the summary. I send each of them in this email.
>>>>>> In these
executions, the program performs Matrix-Vector (MatMult, MatMultAdd)
products and Vector-Vector operations. From what I understand while
reading the logs, the program takes most of the time in
"VecScatterEnd".
>>>>>> In this example, the matrix taking part on the
Matrix-Vector products is not "much diagonal heavy".
>>>>>> The
following numbers are the percentages of nnz values on the matrix
diagonal block for each machine, and each execution time.
>>>>>>
NMachines %NNZ ExecTime
>>>>>> 1 machine0 100%; 16min08sec
>>>>>>

>>>>>> 2 machine0 91.1%; 24min58sec
>>>>>> machine1 69.2%;
>>>>>>

>>>>>> 3 machine0 90.9% 25min42sec
>>>>>> machine1 82.8%
>>>>>>
machine2 51.6%
>>>>>> 
>>>>>> 4 machine0 91.9% 26min27sec
>>>>>>
machine1 82.4%
>>>>>> machine2 73.1%
>>>>>> machine3 39.9%
>>>>>>

>>>>>> 5 machine0 93.2% 39min23sec
>>>>>> machine1 82.8%
>>>>>>
machine2 74.4%
>>>>>> machine3 64.6%
>>>>>> machine4 31.6%
>>>>>>

>>>>>> 6 machine0 94.2% 54min54sec
>>>>>> machine1 82.6%
>>>>>>
machine2 73.1%
>>>>>> machine3 65.2%
>>>>>> machine4 55.9%
>>>>>>
machine5 25.4%
>>>>> 
>>>>> Based on this I am guessing the last rows of
the matrix have a lot
>>>>> of nonzeros away from the diagonal?
>>>>>

>>>>> There is a big load imbalance in something: for example with
2
>>>>> processes you have
>>>>> 
>>>>> VecMax 10509 1.0 2.0602e+02 4.2
0.00e+00 0.0 0.0e+00
>>>>> 0.0e+00 1.1e+04 9 0 0 0 72 9 0 0 0 72 0
>>>>>
VecScatterEnd 18128 1.0 8.9404e+02 1.3 0.00e+00 0.0 0.0e+00
>>>>>
0.0e+00 0.0e+00 53 0 0 0 0 53 0 0 0 0 0
>>>>> MatMult 10505 1.0
6.5591e+02 1.4 3.16e+10 1.4 2.1e+04
>>>>> 1.2e+06 0.0e+00 37 33 58 38 0
37 33 58 38 0 83
>>>>> MatMultAdd 7624 1.0 7.0028e+02 2.3 3.26e+10 2.1
1.5e+04
>>>>> 2.8e+06 0.0e+00 34 29 42 62 0 34 29 42 62 0 69
>>>>>

>>>>> the 5th column has the imbalance between slowest and
fastest
>>>>> process. It is 4.2 for max, 1.4 for multi and 2.3 for
matmultadd, to
>>>>> get good speed ups these need to be much closer to
1.
>>>>> 
>>>>> How many nonzeros in the matrix are there per process?
Is it very
>>>>> different for difference processes? You really need to
have each
>>>>> process have similar number of matrix nonzeros. Do you
have a
>>>>> picture of the nonzero structure of the matrix? Where does
the matrix
>>>>> come from, why does it have this structure?
>>>>>

>>>>> Also likely there are just to many vector entries that need to
be
>>>>> scattered to the last process for the matmults.
>>>>> 
>>>>>>
In this implementation I'm using MatCreate and VecCreate. I'm also
leaving the partition sizes in PETSC_DECIDE.
>>>>>> 
>>>>>> Finally, to
run the application, I'm using mpirun.hydra from mpich, downloaded by
PETSc configure script.
>>>>>> I'm checking the process assignment as
suggested on the last email.
>>>>>> 
>>>>>> Am I missing anything?
>>>>>

>>>>> Your network is very poor; likely ethernet. It is had to get
much
>>>>> speedup with such slow reductions and sends and
receives.
>>>>> 
>>>>> Average time to get PetscTime():
1.19209e-07
>>>>> Average time for MPI_Barrier(): 0.000215769
>>>>>
Average time for zero size MPI_Send(): 5.94854e-05
>>>>> 
>>>>> I think
you are seeing such bad results due to an unkind matrix
>>>>> nonzero
structure giving per load balance and too much communication
>>>>> and a
very poor computer network that just makes all the needed
>>>>>
communication totally dominate.
>>>>> 
>>>>> Regards,
>>>>> Nelson
>>>>>

>>>>> Em 2015-08-20 16:17, Matthew Knepley escreveu:
>>>>> 
>>>>> On
Thu, Aug 20, 2015 at 6:30 AM, Nelson Filipe Lopes da Silva
<nelsonflsilva at ist.utl.pt [3]> wrote:
>>>>> Hello.
>>>>> 
>>>>> I am
sorry for the long time without response. I decided to rewrite my
application in a different way and will send the log_summary output when
done reimplementing.
>>>>> 
>>>>> As for the machine, I am using mpirun
to run jobs in a 8 node cluster. I modified the makefile on the steams
folder so it would run using my hostfile.
>>>>> The output is attached
to this email. It seems reasonable for a cluster with 8 machines. From
"lscpu", each machine cpu has 4 cores and 1 socket.
>>>>> 1) You
launcher is placing processes haphazardly. I would figure out how to
assign them to certain nodes
>>>>> 2) Each node has enough bandwidth for
1 core, so it does not make much sense to use more than 1.
>>>>>
Thanks,
>>>>> Matt
>>>>> 
>>>>> Cheers,
>>>>> Nelson
>>>>> 
>>>>> Em
2015-07-24 16:50, Barry Smith escreveu:
>>>>> It would be very helpful
if you ran the code on say 1, 2, 4, 8, 16
>>>>> ... processes with the
option -log_summary and send (as attachments)
>>>>> the log summary
information.
>>>>> 
>>>>> Also on the same machine run the streams
benchmark; with recent
>>>>> releases of PETSc you only need to do
>>>>>

>>>>> cd $PETSC_DIR
>>>>> make streams NPMAX=16 (or whatever your
largest process count is)
>>>>> 
>>>>> and send the output.
>>>>> 
>>>>>
I suspect that you are doing everything fine and it is more an
issue
>>>>> with the configuration of your machine. Also read the
information at
>>>>>
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
> 
> -- 
>

> What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
> -- Norbert Wiener

Links:
------
[1]
mailto:nelsonflsilva at ist.utl.pt
[2] mailto:nelsonflsilva at ist.utl.pt
[3]
mailto:nelsonflsilva at ist.utl.pt
[4] mailto:nelsonflsilva at ist.utl.pt
[5]
mailto:nelsonflsilva at ist.utl.pt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150824/31a0970a/attachment.html>