[petsc-users] Scalability issue
Nelson Filipe Lopes da Silva
nelsonflsilva at ist.utl.pt
Sat Aug 22 16:17:21 CDT 2015
Hi.
I managed to finish the re-implementation. I ran the program
with 1,2,3,4,5,6 machines and saved the summary. I send each of them in
this email.
In these executions, the program performs Matrix-Vector
(MatMult, MatMultAdd) products and Vector-Vector operations. From what I
understand while reading the logs, the program takes most of the time in
"VecScatterEnd".
In this example, the matrix taking part on the
Matrix-Vector products is not "much diagonal heavy".
The following
numbers are the percentages of nnz values on the matrix diagonal block
for each machine, and each execution time.
NMachines %NNZ ExecTime
1
machine0 100%; 16min08sec
2 machine0 91.1%; 24min58sec
machine1
69.2%;
3 machine0 90.9% 25min42sec
machine1 82.8%
machine2 51.6%
4
machine0 91.9% 26min27sec
machine1 82.4%
machine2 73.1%
machine3
39.9%
5 machine0 93.2% 39min23sec
machine1 82.8%
machine2 74.4%
machine3 64.6%
machine4 31.6%
6 machine0 94.2% 54min54sec
machine1
82.6%
machine2 73.1%
machine3 65.2%
machine4 55.9%
machine5 25.4%
In this implementation I'm using MatCreate and VecCreate. I'm also
leaving the partition sizes in PETSC_DECIDE.
Finally, to run the
application, I'm using mpirun.hydra from mpich, downloaded by PETSc
configure script.
I'm checking the process assignment as suggested on
the last email.
Am I missing anything?
Regards,
Nelson
Em 2015-08-20
16:17, Matthew Knepley escreveu:
> On Thu, Aug 20, 2015 at 6:30 AM,
Nelson Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt [3]> wrote:
>
>>
Hello.
>>
>> I am sorry for the long time without response. I decided
to rewrite my application in a different way and will send the
log_summary output when done reimplementing.
>>
>> As for the machine,
I am using mpirun to run jobs in a 8 node cluster. I modified the
makefile on the steams folder so it would run using my hostfile.
>> The
output is attached to this email. It seems reasonable for a cluster with
8 machines. From "lscpu", each machine cpu has 4 cores and 1 socket.
>
> 1) You launcher is placing processes haphazardly. I would figure out
how to assign them to certain nodes
> 2) Each node has enough bandwidth
for 1 core, so it does not make much sense to use more than 1.
>
Thanks,
> Matt
>
>> Cheers,
>> Nelson
>>
>> Em 2015-07-24 16:50,
Barry Smith escreveu:
>>
>>> It would be very helpful if you ran the
code on say 1, 2, 4, 8, 16
>>> ... processes with the option
-log_summary and send (as attachments)
>>> the log summary
information.
>>>
>>> Also on the same machine run the streams
benchmark; with recent
>>> releases of PETSc you only need to do
>>>
>>> cd $PETSC_DIR
>>> make streams NPMAX=16 (or whatever your largest
process count is)
>>>
>>> and send the output.
>>>
>>> I suspect that
you are doing everything fine and it is more an issue
>>> with the
configuration of your machine. Also read the information at
>>>
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers [2] on
>>>
"binding"
>>>
>>> Barry
>>>
>>>> On Jul 24, 2015, at 10:41 AM, Nelson
Filipe Lopes da Silva <nelsonflsilva at ist.utl.pt [1]> wrote:
>>>>
>>>>
Hello,
>>>>
>>>> I have been using PETSc for a few months now, and it
truly is fantastic piece of software.
>>>>
>>>> In my particular
example I am working with a large, sparse distributed (MPI AIJ) matrix
we can refer as 'G'.
>>>> G is a horizontal - retangular matrix (for
example, 1,1 Million rows per 2,1 Million columns). This matrix is
commonly very sparse and not diagonal 'heavy' (for example 5,2 Million
nnz in which ~50% are on the diagonal block of MPI AIJ
representation).
>>>> To work with this matrix, I also have a few
parallel vectors (created using MatCreate Vec), we can refer as 'm' and
'k'.
>>>> I am trying to parallelize an iterative algorithm in which the
most computational heavy operations are:
>>>>
>>>> ->Matrix-Vector
Multiplication, more precisely G * m + k = b (MatMultAdd). From what I
have been reading, to achive a good speedup in this operation, G should
be as much diagonal as possible, due to overlapping communication and
computation. But even when using a G matrix in which the diagonal block
has ~95% of the nnz, I cannot get a decent speedup. Most of the times,
the performance even gets worse.
>>>>
>>>> ->Matrix-Matrix
Multiplication, in this case I need to perform G * G' = A, where A is
later used on the linear solver and G' is transpose of G. The speedup in
this operation is not worse, although is not very good.
>>>>
>>>>
->Linear problem solving. Lastly, In this operation I compute "Ax=b"
from the last two operations. I tried to apply a RCM permutation to A to
make it more diagonal, for better performance. However, the problem I
faced was that, the permutation is performed locally in each processor
and thus, the final result is different with different number of
processors. I assume this was intended to reduce communication. The
solution I found was
>>>> 1-calculate A
>>>> 2-calculate, localy to 1
machine, the RCM permutation IS using A
>>>> 3-apply this permutation to
the lines of G.
>>>> This works well, and A is generated as if RCM
permuted. It is fine to do this operation in one machine because it is
only done once while reading the input. The nnz of G become more spread
and less diagonal, causing problems when calculating G * m + k = b.
>>>>
>>>> These 3 operations (except the permutation) are performed in each
iteration of my algorithm.
>>>>
>>>> So, my questions are.
>>>> -What
are the characteristics of G that lead to a good speedup in the
operations I described? Am I missing something and too much obsessed
with the diagonal block?
>>>>
>>>> -Is there a better way to permute A
without permute G and still get the same result using 1 or N
machines?
>>>>
>>>> I have been avoiding asking for help for a while.
I'm very sorry for the long email.
>>>> Thank you very much for your
time.
>>>> Best Regards,
>>>> Nelson
>
> --
>
> What most
experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments
lead.
> -- Norbert Wiener
Links:
------
[1]
mailto:nelsonflsilva at ist.utl.pt
[2]
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers
[3]
mailto:nelsonflsilva at ist.utl.pt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log01P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0006.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log02P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0007.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log03P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0008.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log04P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0009.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log05P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0010.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Log06P
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150822/b1b107e9/attachment-0011.ksh>
More information about the petsc-users
mailing list