[mpich-discuss] thread MPI calls

Mon Jul 27 13:25:35 CDT 2009

Nicholas and Pavan, and D,

Some tiny bit of info ffrom my work.  When we run our parallizzed application,
we like to see time spent in comminucation, MPICH2 that is, being 30% or less.
That includes setting up the data exahnge, and process having to wait for other
processes.  Our application will make good number of coummunication, in the
100 Billion to few trillion, over a period of hours to months.  

The data sent between the process is as litle as 8 bytes, to few MByte.  Most of
the time, the data are small, < 128 bytes.

MPICH2 does a good job of buffering send, but that really did not help us as
much as there is one recv for each send.

Our monitoring show most recv takes around 0.6us on most of our boxes.  We are
talking about $12K boxes here, so that is not bad.

Currently, we observes 5-10% in recv time that we hope to eliminate by parallizing
the recv operations (when the main thread is working meaningful work).   That fits our 
application very well as recv can be isolated easily.

We have tried these on tests that run for hours :
 -  non threaded, blocked recv      :  this is still the fastest solution
 -  non thread, Irecv                      :  bad
 -  non-thread, pre-launched Irecv: bad, but not too bad
 -  thread multiple                         : very bad
 -  thread multiple with irecv        : not as bad as very bad
 -  thread funnel                             : super bad

we analyzed a few tests where the 'recv' thread could have run in parallel with the main thread
nicely, and yet negative gains are observed.  We don't have any theory why that could have p
happened.

So, we are particularly curios with what is happening with thread multiple ?   We have 1
thing common between thread and non-threaded-Irecv test : Wait_all, could this be the cause ?

thanks
tan

________________________________
From: Pavan Balaji <balaji at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Sent: Saturday, July 25, 2009 2:40:45 PM
Subject: Re: [mpich-discuss] thread MPI calls

Nicholas,

From what I understand about your application, there are two approaches you can use:

1. Use no threads -- in this case, each worker posts Irecv's from all processes its expecting messages from (using MPI_ANY_SOURCE if needed) and do some loop similar to:

till_work_is_done {
    compute();
    Waitany() or Testany()
}

This is the approach most master-worker applications, such as mpiBLAST, tend to use.

2. Otherwise, you can use threads as you suggested below. It is true that there is some overhead, but unless you are using very small messages (e.g., < 64 bytes), and performing a lot of communication (e.g., 90% of your application is communication time), you'll not notice any overhead.

In any case, we are working on some enhanced designs that will minimize threading overheads even in such rare cases (some of them are experimentally included in the mpich2-1.1.x series).

-- Pavan

On 07/25/2009 01:38 PM, Nicolas Rosner wrote:
> Sorry to shamelessly invade Tan's, but since we're in the middle of a
> thread about threads, I thought I'd rephrase an old question I once
> tried to state here -- with a lot less understanding of the problem
> back then. Context:  My app is a rather typical
> single-centralized-master, N-1-worker, pool-of-tasks setup, except
> that workers don't just consume, but may also split tasks that seem
> too hard, which ends up pushing large series of new child tasks back
> to where the parent had been obtained, and so on, recursively.
> 
> Now, most task-related messages and structures don't handle the
> real/full tasks (~1 MB to 10 MB of data each), but lightweight proxy
> objects (~1 KB) that represent them -- which works fine in many
> situations where some subset of metadata suffices. However, after a
> worker gets a task assignment, at some point it does need the complete
> file. Conversely, when one is divided, the splitting worker needs to
> store the new files somewhere. The current approach is as follows:
> 
>    - all new task files produced by a worker W are stored to a local
> HD on its host
> 
>    - master is immediately notified about each creation (children are
> pushed back into queue,
>      but by reference only, including a "currently located at ..." field),
> 
>    - when master assigns a task to some other worker Z, the message
> includes said location, and
> 
>    - worker Z then sends a msg to worker W requesting the task, which
> W sends as one fat MPI message.
> 
> This achieves decent load balancing, but although a worker can't
> really do anything productive while waiting for a totally new
> datafile, it may certainly not block if it is to stay deadlock-free --
> it [or *some* component, anyway] needs to be ready to serve any
> potential incoming "full datafile requests" from other workers within
> some constant amount of delay that may not depend on its own pending
> request.
> 
> So, I first tried the nonblocking approach; no good, worker + file
> server chores combined yield a giant statechart, way too complex to
> debug and maintain. Then, trying to stay away from hybrid thread+MPI
> code, I tried separating the worker and server as two different
> processes, and ran twice as many processes as available processors.
> Although Linux did a pretty good job scheduling them (not surprising
> since they're almost purely cpu and i/o bound, respectively), there is
> some performance penalty, plus it's messy to be keeping track of how
> mpixec deals roles to hosts, e.g. lots of rank -> host mappings that
> were fine suddenly become unusable and must be avoided, etc.
> 
> Eventually I gave up on my "no-hybrid" and "don't want to depend on
> thread_multiple support" wishes, got myself a pthreads tutorial and
> ended up with a worker version that uses a 2nd thread (+ possibly
> several subthreads thereof) to keep serving files regardless of main
> thread (actual worker code) status -- and with cleaner, better
> separated code. (Note that there is little or no interaction between
> Wi and Si once launched -- Si just needs to keep serving files that Wi
> produced earlier on but doesn't even remember, let alone care about,
> anymore. Both need MPI messaging all the time, though.)
> 
> Bottom line: as scary as it looked to me to go hybrid, and despite
> several warnings from experts against it, in this particular case it
> turned out to be simpler than many of the clumsy attempts at avoiding
> it, and probably the "least ugly" approach I've found so far.
> 
> Questions: Do you think this solution is OK? Any suggestions for
> alternatives or improvements?  Am I wrong in thinking that this kind
> of need must be a common one in this field?  How do people normally
> address it?  Distributed filesystems, perhaps?  Or something like a
> lightweight http server on each host? Shared central storage is not an
> option -- its scalability hits the ceiling way too quickly. Is the
> performance penalty that Pavan just mentioned (for M_TH_MULTIPLE) of
> considerable magnitude?  Do you think it could be avoided, while still
> keeping the worker code reasonably isolated and unaware of the serving
> part?
> 
> Thanks in advance for any comments, and in retrospective for all the
> useful info here.
> 
> Sincerely
> N.

-- Pavan Balaji
http://www.mcs.anl.gov/~balaji

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090727/c85b03cb/attachment.htm>