[mpich-discuss] thread MPI calls

Fri Jul 31 16:13:18 CDT 2009

Even without the specifics, it would be useful to have a test program  
that could run these tests and present the differences in  
performance.  And a version that, like "stress", could run for a  
specified length of time, would be helpful in testing for races in the  
thread and smp code.  Could one of the summer students write such an  
example?

Bill

On Jul 27, 2009, at 1:25 PM, chong tan wrote:

>
> Nicholas and Pavan, and D,
>
> Some tiny bit of info ffrom my work.  When we run our parallizzed  
> application,
> we like to see time spent in comminucation, MPICH2 that is, being  
> 30% or less.
> That includes setting up the data exahnge, and process having to  
> wait for other
> processes.  Our application will make good number of coummunication,  
> in the
> 100 Billion to few trillion, over a period of hours to months.
>
> The data sent between the process is as litle as 8 bytes, to few  
> MByte.  Most of
> the time, the data are small, < 128 bytes.
>
> MPICH2 does a good job of buffering send, but that really did not  
> help us as
> much as there is one recv for each send.
>
> Our monitoring show most recv takes around 0.6us on most of our  
> boxes.  We are
> talking about $12K boxes here, so that is not bad.
>
> Currently, we observes 5-10% in recv time that we hope to eliminate  
> by parallizing
> the recv operations (when the main thread is working meaningful  
> work).   That fits our
> application very well as recv can be isolated easily.
>
> We have tried these on tests that run for hours :
>  -  non threaded, blocked recv      :  this is still the fastest  
> solution
>  -  non thread, Irecv                      :  bad
>  -  non-thread, pre-launched Irecv: bad, but not too bad
>  -  thread multiple                         : very bad
>  -  thread multiple with irecv        : not as bad as very bad
>  -  thread funnel                             : super bad
>
> we analyzed a few tests where the 'recv' thread could have run in  
> parallel with the main thread
> nicely, and yet negative gains are observed.  We don't have any  
> theory why that could have p
> happened.
>
> So, we are particularly curios with what is happening with thread  
> multiple ?   We have 1
> thing common between thread and non-threaded-Irecv test : Wait_all,  
> could this be the cause ?
>
> thanks
> tan
>
> From: Pavan Balaji <balaji at mcs.anl.gov>
> To: mpich-discuss at mcs.anl.gov
> Sent: Saturday, July 25, 2009 2:40:45 PM
> Subject: Re: [mpich-discuss] thread MPI calls
>
> Nicholas,
>
> From what I understand about your application, there are two  
> approaches you can use:
>
> 1. Use no threads -- in this case, each worker posts Irecv's from  
> all processes its expecting messages from (using MPI_ANY_SOURCE if  
> needed) and do some loop similar to:
>
> till_work_is_done {
>     compute();
>     Waitany() or Testany()
> }
>
> This is the approach most master-worker applications, such as  
> mpiBLAST, tend to use.
>
> 2. Otherwise, you can use threads as you suggested below. It is true  
> that there is some overhead, but unless you are using very small  
> messages (e.g., < 64 bytes), and performing a lot of communication  
> (e.g., 90% of your application is communication time), you'll not  
> notice any overhead.
>
> In any case, we are working on some enhanced designs that will  
> minimize threading overheads even in such rare cases (some of them  
> are experimentally included in the mpich2-1.1.x series).
>
> -- Pavan
>
> On 07/25/2009 01:38 PM, Nicolas Rosner wrote:
> > Sorry to shamelessly invade Tan's, but since we're in the middle  
> of a
> > thread about threads, I thought I'd rephrase an old question I once
> > tried to state here -- with a lot less understanding of the problem
> > back then. Context:  My app is a rather typical
> > single-centralized-master, N-1-worker, pool-of-tasks setup, except
> > that workers don't just consume, but may also split tasks that seem
> > too hard, which ends up pushing large series of new child tasks back
> > to where the parent had been obtained, and so on, recursively.
> >
> > Now, most task-related messages and structures don't handle the
> > real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
> > objects (~1 KB) that represent them -- which works fine in many
> > situations where some subset of metadata suffices. However, after a
> > worker gets a task assignment, at some point it does need the  
> complete
> > file. Conversely, when one is divided, the splitting worker needs to
> > store the new files somewhere. The current approach is as follows:
> >
> >    - all new task files produced by a worker W are stored to a local
> > HD on its host
> >
> >    - master is immediately notified about each creation (children  
> are
> > pushed back into queue,
> >      but by reference only, including a "currently located at ..."  
> field),
> >
> >    - when master assigns a task to some other worker Z, the message
> > includes said location, and
> >
> >    - worker Z then sends a msg to worker W requesting the task,  
> which
> > W sends as one fat MPI message.
> >
> > This achieves decent load balancing, but although a worker can't
> > really do anything productive while waiting for a totally new
> > datafile, it may certainly not block if it is to stay deadlock- 
> free --
> > it [or *some* component, anyway] needs to be ready to serve any
> > potential incoming "full datafile requests" from other workers  
> within
> > some constant amount of delay that may not depend on its own pending
> > request.
> >
> > So, I first tried the nonblocking approach; no good, worker + file
> > server chores combined yield a giant statechart, way too complex to
> > debug and maintain. Then, trying to stay away from hybrid thread+MPI
> > code, I tried separating the worker and server as two different
> > processes, and ran twice as many processes as available processors.
> > Although Linux did a pretty good job scheduling them (not surprising
> > since they're almost purely cpu and i/o bound, respectively),  
> there is
> > some performance penalty, plus it's messy to be keeping track of how
> > mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
> > were fine suddenly become unusable and must be avoided, etc.
> >
> > Eventually I gave up on my "no-hybrid" and "don't want to depend on
> > thread_multiple support" wishes, got myself a pthreads tutorial and
> > ended up with a worker version that uses a 2nd thread (+ possibly
> > several subthreads thereof) to keep serving files regardless of main
> > thread (actual worker code) status -- and with cleaner, better
> > separated code. (Note that there is little or no interaction between
> > Wi and Si once launched -- Si just needs to keep serving files  
> that Wi
> > produced earlier on but doesn't even remember, let alone care about,
> > anymore. Both need MPI messaging all the time, though.)
> >
> > Bottom line: as scary as it looked to me to go hybrid, and despite
> > several warnings from experts against it, in this particular case it
> > turned out to be simpler than many of the clumsy attempts at  
> avoiding
> > it, and probably the "least ugly" approach I've found so far.
> >
> > Questions: Do you think this solution is OK? Any suggestions for
> > alternatives or improvements?  Am I wrong in thinking that this kind
> > of need must be a common one in this field?  How do people normally
> > address it?  Distributed filesystems, perhaps?  Or something like a
> > lightweight http server on each host? Shared central storage is  
> not an
> > option -- its scalability hits the ceiling way too quickly. Is the
> > performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
> > considerable magnitude?  Do you think it could be avoided, while  
> still
> > keeping the worker code reasonably isolated and unaware of the  
> serving
> > part?
> >
> > Thanks in advance for any comments, and in retrospective for all the
> > useful info here.
> >
> > Sincerely
> > N.
>
> -- Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>

William Gropp
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090731/3136b5c7/attachment-0001.htm>