Slow speed after changing from serial to parallel (with ex2f.F)

Sat Apr 19 13:19:34 CDT 2008

Ben,

This conversation is getting long and winding. And we are are getting
into your cluster adminstration - which is not PETSc related.

I'll sugest you figureout about using the cluster from your system
admin and how to use bsub.

http://www.vub.ac.be/BFUCC/LSF/man/bsub.1.html

However I'll point out the following things.

- I'll sugest learning about scheduling an interactive job on your
  cluster. This will help you with running multiple jobs on the same
  machine.

- When making comparisions, have minimum changes between thing you
compare runs.

 * For eg: you are comparing runs between different queues '-q
 linux64' '-q mcore_parallel'. There might be differences here that
 can result in different performance.

 * If you are getting part of the machine [for -n 1 jobs] - verify if
 you are sharing the other part with some other job. Without this
 verification - your numbers are not meaningful. [depending upon how
 the queue is configured - it can either allocate part of the node or
 full node]

 * you should be able to request 4procs [i.e 1 complete machine] but
 be able to run either -np 1, 2 or 4 on the allocation. [This is
 easier to do in interactive mode]. This ensures nobody else is using
 the machine.  And you can run your code multiple times - to see if
 you are getting consistant results.

Regarding the primary issue you've had - with performance debugging
your PETSc appliation in *SMP-mode*, we've observed performance
anamolies in your log_summary for both your code, and ex2.f.F This
could be due one or more of the following:

- issues in your code
- issues with MPI you are using
- isues with the cluster you are using.

To narrow down - the comparisions I sugest:

- compare my ex2f.F with the *exact* same runs on your machine [You've
claimed that you also hav access to a 2-quad-core Intel Xeon X5355
machine]. So you should be able to reproduce the exact same experiment
as me - and compare the results. This should keep both software same -
and show differences in system software etc..

>>>>>
  No of Nodes Processors Qty per node Total cores per node Memory per node  
  4 Quad-Core Intel Xeon X5355 2 8 16 GB  
                               ^^^ 
  60 Dual-Core Intel Xeon 5160 2 4 8 GB
<<<<<

i.e configure latest mpich2 with  [default compilers gcc/gfortran]:
./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

Build PETSc with this MPI [and same compilers]
./config/configure.py --with-mpi-dir= --with-debugging=0

And run ex2f.F 600x600 on 1, 2, 4, 8 procs on a *single* X5355
machine. [it might have a different queue name]

- Now compare ex2f.F performance wtih MPICH [as built above] and the
current MPI you are using. This should identify the performance
differences between MPI implemenations within the box [within the SMP
box]

- Now compare runs between ex2f.F and your application.

At each of the above steps of comparision - we are hoping to identify
the reason for differences and rectify. Perhaps this is not possible
on your cluster and you can't improve on what you already have..

If you can't debug the SMP performance issues, you can avoid SMP
completely, and use 1 MPI task per machine [or 1 MPI task per memory
bank => 2 per machine]. But you'll still have to do similar analysis
to make sure there are no performance anamolies in the tool chain.

[i.e hardware, system software, MPI, application]

If you are willing to do the above steps, we can help with the
comparisions. As mentioned - this is getting long and windy. If you
have futher questions in this regard - we should contiune it at
petsc-maint at mcs.anl.gov

Satish

On Sat, 19 Apr 2008, Ben Tay wrote:

> Hi Satish,
> 
> 1st of all, I forgot to inform u that I've changed the m and n to 800. I would
> like to see if the larger value can make the scaling better. If req, I can
> redo the test with m,n=600.
> 
> I can install MPICH but I don't think I can choose to run on a single machine
> using from 1 to 8 procs. In order to run the code, I usually have to use the
> command
> 
> bsub -o log -q linux64 ./a.out       for single procs
> 
> bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no.
> of procs.       for multiple procs
> 
> After that, when the job is running, I'll be given the server which my job
> runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or
> 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told
> that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu.
> 
> Btw, are you saying that I should 1st install the latest MPICH2 build with the
> option :
> 
> ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install
> PETSc with the MPICH2?
> 
> So after that do you know how to do what you've suggest for my servers? I
> don't really understand what you mean. May I supposed to run 4 jobs on 1
> quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that
> atlas3-c00 to c03 are the location of the quad cores. I can force to use them
> by
> 
> bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out
> 
> Lastly, I make a mistake in the different times reported by the same compiler.
> Sorry abt that.
> 
> Thank you very much.