[mpich-discuss] thread MPI calls
Pavan Balaji
balaji at mcs.anl.gov
Mon Jul 27 15:35:48 CDT 2009
Tan,
Would it be possible to send us your application, or better still, a
simple benchmark that can reproduce this problem?
It's hard to debug such performance issues over email without actually
looking through the code.
-- Pavan
On 07/27/2009 01:25 PM, chong tan wrote:
>
> Nicholas and Pavan, and D,
>
> Some tiny bit of info ffrom my work. When we run our parallizzed
> application,
> we like to see time spent in comminucation, MPICH2 that is, being 30% or
> less.
> That includes setting up the data exahnge, and process having to wait
> for other
> processes. Our application will make good number of coummunication, in the
> 100 Billion to few trillion, over a period of hours to months.
>
> The data sent between the process is as litle as 8 bytes, to few MByte.
> Most of
> the time, the data are small, < 128 bytes.
>
> MPICH2 does a good job of buffering send, but that really did not help us as
> much as there is one recv for each send.
>
> Our monitoring show most recv takes around 0.6us on most of our boxes.
> We are
> talking about $12K boxes here, so that is not bad.
>
> Currently, we observes 5-10% in recv time that we hope to eliminate by
> parallizing
> the recv operations (when the main thread is working meaningful work).
> That fits our
> application very well as recv can be isolated easily.
>
> We have tried these on tests that run for hours :
> - non threaded, blocked recv : this is still the fastest solution
> - non thread, Irecv : bad
> - non-thread, pre-launched Irecv: bad, but not too bad
> - thread multiple : very bad
> - thread multiple with irecv : not as bad as very bad
> - thread funnel : super bad
>
> we analyzed a few tests where the 'recv' thread could have run in
> parallel with the main thread
> nicely, and yet negative gains are observed. We don't have any theory
> why that could have p
> happened.
>
> So, we are particularly curios with what is happening with thread
> multiple ? We have 1
> thing common between thread and non-threaded-Irecv test : Wait_all,
> could this be the cause ?
>
> thanks
> tan
>
> ------------------------------------------------------------------------
> *From:* Pavan Balaji <balaji at mcs.anl.gov>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Saturday, July 25, 2009 2:40:45 PM
> *Subject:* Re: [mpich-discuss] thread MPI calls
>
> Nicholas,
>
> From what I understand about your application, there are two approaches
> you can use:
>
> 1. Use no threads -- in this case, each worker posts Irecv's from all
> processes its expecting messages from (using MPI_ANY_SOURCE if needed)
> and do some loop similar to:
>
> till_work_is_done {
> compute();
> Waitany() or Testany()
> }
>
> This is the approach most master-worker applications, such as mpiBLAST,
> tend to use.
>
> 2. Otherwise, you can use threads as you suggested below. It is true
> that there is some overhead, but unless you are using very small
> messages (e.g., < 64 bytes), and performing a lot of communication
> (e.g., 90% of your application is communication time), you'll not notice
> any overhead.
>
> In any case, we are working on some enhanced designs that will minimize
> threading overheads even in such rare cases (some of them are
> experimentally included in the mpich2-1.1.x series).
>
> -- Pavan
>
> On 07/25/2009 01:38 PM, Nicolas Rosner wrote:
> > Sorry to shamelessly invade Tan's, but since we're in the middle of a
> > thread about threads, I thought I'd rephrase an old question I once
> > tried to state here -- with a lot less understanding of the problem
> > back then. Context: My app is a rather typical
> > single-centralized-master, N-1-worker, pool-of-tasks setup, except
> > that workers don't just consume, but may also split tasks that seem
> > too hard, which ends up pushing large series of new child tasks back
> > to where the parent had been obtained, and so on, recursively.
> >
> > Now, most task-related messages and structures don't handle the
> > real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
> > objects (~1 KB) that represent them -- which works fine in many
> > situations where some subset of metadata suffices. However, after a
> > worker gets a task assignment, at some point it does need the complete
> > file. Conversely, when one is divided, the splitting worker needs to
> > store the new files somewhere. The current approach is as follows:
> >
> > - all new task files produced by a worker W are stored to a local
> > HD on its host
> >
> > - master is immediately notified about each creation (children are
> > pushed back into queue,
> > but by reference only, including a "currently located at ..."
> field),
> >
> > - when master assigns a task to some other worker Z, the message
> > includes said location, and
> >
> > - worker Z then sends a msg to worker W requesting the task, which
> > W sends as one fat MPI message.
> >
> > This achieves decent load balancing, but although a worker can't
> > really do anything productive while waiting for a totally new
> > datafile, it may certainly not block if it is to stay deadlock-free --
> > it [or *some* component, anyway] needs to be ready to serve any
> > potential incoming "full datafile requests" from other workers within
> > some constant amount of delay that may not depend on its own pending
> > request.
> >
> > So, I first tried the nonblocking approach; no good, worker + file
> > server chores combined yield a giant statechart, way too complex to
> > debug and maintain. Then, trying to stay away from hybrid thread+MPI
> > code, I tried separating the worker and server as two different
> > processes, and ran twice as many processes as available processors.
> > Although Linux did a pretty good job scheduling them (not surprising
> > since they're almost purely cpu and i/o bound, respectively), there is
> > some performance penalty, plus it's messy to be keeping track of how
> > mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
> > were fine suddenly become unusable and must be avoided, etc.
> >
> > Eventually I gave up on my "no-hybrid" and "don't want to depend on
> > thread_multiple support" wishes, got myself a pthreads tutorial and
> > ended up with a worker version that uses a 2nd thread (+ possibly
> > several subthreads thereof) to keep serving files regardless of main
> > thread (actual worker code) status -- and with cleaner, better
> > separated code. (Note that there is little or no interaction between
> > Wi and Si once launched -- Si just needs to keep serving files that Wi
> > produced earlier on but doesn't even remember, let alone care about,
> > anymore. Both need MPI messaging all the time, though.)
> >
> > Bottom line: as scary as it looked to me to go hybrid, and despite
> > several warnings from experts against it, in this particular case it
> > turned out to be simpler than many of the clumsy attempts at avoiding
> > it, and probably the "least ugly" approach I've found so far.
> >
> > Questions: Do you think this solution is OK? Any suggestions for
> > alternatives or improvements? Am I wrong in thinking that this kind
> > of need must be a common one in this field? How do people normally
> > address it? Distributed filesystems, perhaps? Or something like a
> > lightweight http server on each host? Shared central storage is not an
> > option -- its scalability hits the ceiling way too quickly. Is the
> > performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
> > considerable magnitude? Do you think it could be avoided, while still
> > keeping the worker code reasonably isolated and unaware of the serving
> > part?
> >
> > Thanks in advance for any comments, and in retrospective for all the
> > useful info here.
> >
> > Sincerely
> > N.
>
> -- Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list