Unsure about some entries in log_summary for 2B DoF problem

Mon May 4 14:00:56 CDT 2009

On May 4, 2009, at 1:51 PM, Richard Tran Mills wrote:

> Barry,
>
> Is there a way to dump the log data for all processes so that I can  
> do things like look at the the time spent in each stage by each  
> process?  I thought you or someone else had mentioned adding a  
> capability to do this in PETSc to a file format that would be easy  
> to manipulate using something like Python... but perhaps I recall  
> incorrectly.

   Hong has worked a little on this, you are welcome to mess with it  
if you want (Hong, have you pushed this?).

    Barry

>
>
> It would be very nice to have this to calculate some more detailed  
> statistics.
>
> --Richard
>
> Barry Smith wrote:
>>   Matt is completely correct. What this means is that though some  
>> processes wait a long time for the dots, MOST processes don't wait  
>> much at all.
>> In other words, the dot causes very little idle time integrated  
>> over the whole machine.
>> Meanwhile for flow (where the percentage is large) the dots cause a  
>> LARGE amount of idle time integrated over the machine.
>> Why it is high for one and not the other I do not know.
>>   Barry
>> On May 4, 2009, at 1:09 PM, Matthew Knepley wrote:
>>> I believe that the time reported there is collective sum of times  
>>> divided by the collective sum
>>> of the stage times. If you look at the time imbalance, it is a  
>>> staggering 9.7, which either means
>>>
>>>  1) The partition is really crap (which we know isn't true)
>>>
>>>  2) Some procs spend a lot of time waiting
>>>
>>> We can get at this waiting time with the split VecDot() events.
>>>
>>>  Matt
>>>
>>> On Mon, May 4, 2009 at 12:58 PM, Richard Tran Mills <rmills at climate.ornl.gov 
>>> > wrote:
>>> PETSc folks,
>>>
>>> I was looking over the log summary data for the 2 billion degrees  
>>> of freedom transport problem, and I'm a bit puzzled by some of the  
>>> things I'm seeing.  (I sent a tarball of this to the pflotran-dev  
>>> list on April 30.)  For instance, looking at the run at 32768  
>>> cores, I see that the total time for the "transport" phase is  
>>> 3.2139e+02 seconds.  But if I look at the VecDot line for the  
>>> transport stage, I see
>>>
>>> Event                Count      Time (sec)     Flops   --- Global  
>>> ---  --- Stage ---   Total
>>>                  Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
>>> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> VecDot              1306 1.0 4.1529e+01 9.7 1.76e+08 1.1 0.0e+00  
>>> 0.0e+00 1.3e+03  1  0  0  0  1   3  0  0  0 24 128305
>>>
>>> It's hard to read this the way my email client will wrap it, but  
>>> it's saying that 3% of the time in the stage was spent on  
>>> VecDot()s.  But the max time in VecDot is 4.1529e+01, close to  
>>> thirteen percent.  Does the "%T" for the stage mean something  
>>> other than what I think it does?
>>>
>>> --Richard
>>>
>>> -- 
>>> Richard Tran Mills, Ph.D.            |   E-mail: rmills at climate.ornl.gov
>>> Computational Scientist              |   Phone:  (865) 241-3198
>>> Computational Earth Sciences Group   |   Fax:    (865) 574-0405
>>> Oak Ridge National Laboratory        |   http://climate.ornl.gov/~rmills
>>>
>>>
>>>
>>> -- 
>>> What most experimenters take for granted before they begin their  
>>> experiments is infinitely more interesting than any results to  
>>> which their experiments lead.
>>> -- Norbert Wiener
>
>
> -- 
> Richard Tran Mills, Ph.D.            |   E-mail: rmills at climate.ornl.gov
> Computational Scientist              |   Phone:  (865) 241-3198
> Computational Earth Sciences Group   |   Fax:    (865) 574-0405
> Oak Ridge National Laboratory        |   http://climate.ornl.gov/~rmills