[mpich-discuss] MPI_Probe

Mon Jan 9 15:54:24 CST 2012

Hi Jie,

> In my application each process needs data from every other processes
> and performs calculations.

Well, if such exchanges were more synchronous in nature (I mean some
kind of sync over the system seen as a whole thing, e.g. as in the
case of "waves" of computation, where the amount by which any worker
may be "ahead of the rest" at any given time can be bounded a priori
by some constant), then I'd also consider collectives like all-to-all
scatter/gather etc.

> Even though the data are evenly distributed among processes,
> the work loads for each data are different.

Even with large variance, have you measured some estimation of
avg/min/max?  How does it compare in magnitude with your avg/min/max
latency and other "external" overheads?  That ratio, even a coarse
approx. of it, could be useful to decide what's [not] worth optimizing
at this point.

> So other processes don't know when p0 requests data.

OK, this suggests no "waves" but a rather asynchronic, less restricted
protocol where anything may and should happen "as soon as needed",
possibly limited by "as soon as it's ready", etc.

Then a tailor-made async protocol can indeed be written to be more
efficient, e.g. by minimizing gaps in resource usage due to the
variance in task length, than something rigid that syncs a bit too
much.

Of course that's no free lunch -- it will go hand in hand with a
harder to get right, harder to debug set of protocols and situations.

(E.g. if you didn't care about the cost of such gaps, maybe you could
easily implement this on top of a few existing MPI collectives, with
near-zero risks and only a fraction of the time and effort.)

> Thus, a natural design is to let every process have 2 threads,
> while one thread is doing the normal calculations, the other
> is responsible for probing requests and sending/receiving data.

I'd still probably use subprocesses here unless I had good reasons not to.

Hmm, wait -- oh, so all your workers need to be constantly ready to
*serve out* data or problem pieces, asynchronically as well, to others
requesting them?

OK, yes, I probably would consider using two threads for that. You
only need two (2) of them, one of them is pretty straightforward, and
the added complications aren't really that significant.

Still, I'd try to ensure understanding any new-to-me MPI primitives in
a simpler, thread-free context before trying to incorporate their
usage into the actual main project.

> What concerns me, is the efficacy of this design.

Investing time now trying to predict the impact of your design
decisions sounds very good, in the case of any broad, big-O,
potential-bottleneck, limiting-factor issues (as opposed to premature
optimization of details that may or may not matter).

> Since probing costs CPU cycles, and since after one process sends the data request,
> it has to wait the other process to receive the request and to send the data,
> there might be a lot overhead introduced in addition to the normal calculations.

Assuming worker's computation tasks typically take a good few orders
of magnitude longer than the typical msg latency, this sounds like
said overhead would tend to be dominated by the "is the data ready to
be served" factor, which you don't seem to mention here explicitly, do
you?

In other words, I think whether you can guarantee that anything anyone
might be asked to serve will actually be fully calculated and ready
(by that moment) sounds like the main unbounded variable, while others
(msg xfer time, possible delay until peer sees request, polling
intervals etc), assuming the above, should pale in comparison. Even if
the former can be somehow easily guaranteed, I'd do the math to see to
what extent the latter could actually become an issue, in the context
of other (inevitable) costs.

> The MPI standard only defines what probing does, but it does not define how
> probing is implemented. So I guess it would be hard to estimate in advance
> how large the overhead is.

Correct -- if you use Probe, you have to trust it'll be implemented
reasonably, with no guarantees about the finer details. But the same
goes for Recv(), doesn't it? Why is that so important about Probe?

A specific below-the-hood impl. detail, such as whether to poll 71 or
504 times per unit of time while Probe is blocked, should be unlikely
to end up being your main bottleneck or source of overhead.

We're expected to assume it could be any reasonable value for the
function's general purpose on that platform (e.g. not an absurdly
tight busy wait, nor a way too long wait, etc). If Probe's spec
functionallly suits your needs, you should be able to trust the impl.
that much... lest more probably be either trying to optimize
prematurely or using the wrong tool for the job, I think.

If you really care, though, you could always do the polling yourself
(using Iprobe instead of Probe), couldn't you? But it would surprise
me that such a decision paid off merely due to being able to fine-tune
the polling freq.

> On the other hand, knowing the overhead is beneficial,

I'm not sure what you mean here by that.

Hope this helps,

Nicolás.