[petsc-users] log_summary time ratio and flops ratio

Mon Feb 8 17:19:33 CST 2016

  The following routines are all embarrassingly parallel. 

VecAXPY          1001160 1.0 2.0483e+01 2.7 1.85e+10 1.1 0.0e+00 0.0e+00 0.0e+00  3  4  0  0  0   3  4  0  0  0 219358
VecAYPX           600696 1.0 6.6270e+00 2.0 1.11e+10 1.1 0.0e+00 0.0e+00 0.0e+00  1  2  0  0  0   1  2  0  0  0 406161
VecAXPBYCZ           194 1.0 4.9155e-03 1.7 7.17e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 354257
VecWAXPY             954 1.0 4.3450e-02 1.5 8.81e+06 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 49269
VecMAXPY          600696 1.0 1.2831e+01 2.0 2.22e+10 1.1 0.0e+00 0.0e+00 0.0e+00  2  4  0  0  0   2  4  0  0  0 420212

Note the ratio of time is between 1.5 and 2.7 this indicates that there is very likely an imbalance in amount of work for different processes.  So your load balancing is highly suspicious.

  Barry

> On Feb 8, 2016, at 4:21 PM, Xiangdong <epscodes at gmail.com> wrote:
> 
> Based on what you suggested, I have done the following:
> 
> i) rerun the same problem without output. The ratios are still roughly the same. So it is not the problem of IO.
> 
> ii) rerun the program on a supercomputer (Stampede), instead of group cluster. the MPI_Barrier time got better:
> 
> Average time to get PetscTime(): 0
> Average time for MPI_Barrier(): 1.27792e-05
> Average time for zero size MPI_Send(): 3.94508e-06
> 
> the full petsc logsummary is here: https://googledrive.com/host/0BxEfb1tasJxhTjNTVXh4bmJmWlk
> 
> iii) since the time ratios of VecDot (2.5) and MatMult (1.5) are still high, I rerun the program with ipm module. The IPM summary is here: https://drive.google.com/file/d/0BxEfb1tasJxhYXI0VkV0cjlLWUU/view?usp=sharing. From this IPM reuslts, MPI_Allreduce takes 74% of MPI time. The communication by task figure (1st figure in p4) in above link showed that it is not well-balanced. Is this related to the hardware and network (which the users cannot control) or can I do something on my codes to improve?
> 
> Thank you.
> 
> Best,
> Xiangdong
> 
> On Fri, Feb 5, 2016 at 10:34 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>   Make the same run with no IO and see if the numbers are much better and if the load balance is better.
> 
> > On Feb 5, 2016, at 8:59 PM, Xiangdong <epscodes at gmail.com> wrote:
> >
> > If I want to know whether only rank 0 is slow (since it may has more io) or actually a portion of cores are slow, what tools can I start with?
> >
> > Thanks.
> >
> > Xiangdong
> >
> > On Fri, Feb 5, 2016 at 5:27 PM, Jed Brown <jed at jedbrown.org> wrote:
> > Matthew Knepley <knepley at gmail.com> writes:
> > >> I attached the full summary. At the end, it has
> > >>
> > >> Average time to get PetscTime(): 0
> > >> Average time for MPI_Barrier(): 8.3971e-05
> > >> Average time for zero size MPI_Send(): 7.16746e-06
> > >>
> > >> Is it an indication of slow network?
> > >>
> > >
> > > I think so. It takes nearly 100 microseconds to synchronize processes.
> >
> > Edison with 65536 processes:
> > Average time for MPI_Barrier(): 4.23908e-05
> > Average time for zero size MPI_Send(): 2.46466e-06
> >
> > Mira with 16384 processes:
> > Average time for MPI_Barrier(): 5.7075e-06
> > Average time for zero size MPI_Send(): 1.33179e-05
> >
> > Titan with 131072 processes:
> > Average time for MPI_Barrier(): 0.000368595
> > Average time for zero size MPI_Send(): 1.71567e-05
> >
> 
>