<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Aug 20, 2015 at 6:30 AM, Nelson Filipe Lopes da Silva <span dir="ltr"><<a href="mailto:nelsonflsilva@ist.utl.pt" target="_blank">nelsonflsilva@ist.utl.pt</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello.<br>

<br>

I am sorry for the long time without response. I decided to rewrite my application in a different way and will send the log_summary output when done reimplementing.<br>

<br>

As for the machine, I am using mpirun to run jobs in a 8 node cluster. I modified the makefile on the steams folder so it would run using my hostfile.<br>

The output is attached to this email. It seems reasonable for a cluster with 8 machines. From "lscpu", each machine cpu has 4 cores and 1 socket.<br></blockquote><div><br></div><div>1) You launcher is placing processes haphazardly. I would figure out how to assign them to certain nodes</div><div><br></div><div>2) Each node has enough bandwidth for 1 core, so it does not make much sense to use more than 1.</div><div><br></div><div>  Thanks,</div><div><br></div><div>    Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Cheers,<br>

Nelson<br>

<br>

<br>

Em 2015-07-24 16:50, Barry Smith escreveu:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

It would be very helpful if you ran the code on say 1, 2, 4, 8, 16<br>

... processes with the option -log_summary and send (as attachments)<br>

the log summary information.<br>

<br>

   Also on the same machine run the streams benchmark; with recent<br>

releases of PETSc you only need to do<br>

<br>

cd $PETSC_DIR<br>

make streams NPMAX=16 (or whatever your largest process count is)<br>

<br>

and send the output.<br>

<br>

I suspect that you are doing everything fine and it is more an issue<br>

with the configuration of your machine. Also read the information at<br>

<a href="http://www.mcs.anl.gov/petsc/documentation/faq.html#computers" rel="noreferrer" target="_blank">http://www.mcs.anl.gov/petsc/documentation/faq.html#computers</a> on<br>

"binding"<br>

<br>

  Barry<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva <<a href="mailto:nelsonflsilva@ist.utl.pt" target="_blank">nelsonflsilva@ist.utl.pt</a>> wrote:<br>

<br>

Hello,<br>

<br>

I have been using PETSc for a few months now, and it truly is fantastic piece of software.<br>

<br>

In my particular example I am working with a large, sparse distributed (MPI AIJ) matrix we can refer as 'G'.<br>

G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 Million columns). This matrix is commonly very sparse and not diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block of MPI AIJ representation).<br>

To work with this matrix, I also have a few parallel vectors (created using MatCreate Vec), we can refer as 'm' and 'k'.<br>

I am trying to parallelize an iterative algorithm in which the most computational heavy operations are:<br>

<br>

->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). From what I have been reading, to achive a good speedup in this operation, G should be as much diagonal as possible, due to overlapping communication and computation. But even when using a G matrix in which the diagonal block has ~95% of the nnz, I cannot get a decent speedup. Most of the times, the performance even gets worse.<br>

<br>

->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, where A is later used on the linear solver and G' is transpose of G. The speedup in this operation is not worse, although is not very good.<br>

<br>

->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the last two operations. I tried to apply a RCM permutation to A to make it more diagonal, for better performance. However, the problem I faced was that, the permutation is performed locally in each processor and thus, the final result is different with different number of processors. I assume this was intended to reduce communication. The solution I found was<br>

1-calculate A<br>

2-calculate, localy to 1 machine, the RCM permutation IS using A<br>

3-apply this permutation to the lines of G.<br>

This works well, and A is generated as if RCM permuted. It is fine to do this operation in one machine because it is only done once while reading the input. The nnz of G become more spread and less diagonal, causing problems when calculating G * m + k = b.<br>

<br>

These 3 operations (except the permutation) are performed in each iteration of my algorithm.<br>

<br>

So, my questions are.<br>

-What are the characteristics of G that lead to a good speedup in the operations I described? Am I missing something and too much obsessed with the diagonal block?<br>

<br>

-Is there a better way to permute A without permute G and still get the same result using 1 or N machines?<br>

<br>

<br>

I have been avoiding asking for help for a while. I'm very sorry for the long email.<br>

Thank you very much for your time.<br>

Best Regards,<br>

Nelson<br>

</blockquote></blockquote>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div>