[mpich-discuss] Large MPI messages?

Tue Sep 30 21:29:53 CDT 2008

Hi Nicolas,

You should be able to send large messages no problem. You do need to  
be careful to allow for progress to be made though, so for example  
sending large messages between a pair of processes is best done by  
posting a non-blocking receive before the send, etc. Others here can  
give more detail on that topic if necessary.

That's a pretty awful setup for your storage. You're probably right  
that moving data around through MPI might do better, particularly if  
you could execute the "I/O process" on the NFS server and avoid NFS  
altogether. Or you could try running a file system that supports IB  
(e.g. PVFS, Lustre), or perhaps using your NFS over IPoIB.

Rob

On Sep 30, 2008, at 8:57 PM, Nicolas Rosner wrote:

> Hello mpich-discuss,
>
> What do you do if you need to send really big messages (like, say,
> 500KB, or even a few megabytes) over MPI? Is that possible at all? Any
> experience with fragmentation schemes if there's a hard limit to this?
>
> My app uses a "pool of tasks" approach. Agents both consume tasks from
> the pool and push new (hopefully smaller) tasks back into it. A
> dedicated "pool" process keeps the requests synchronized, and the fact
> that the pool is a centralized entity isn't that much of a concern
> because the average task is rather long to execute (in the order of
> minutes), and the pool only deals task IDs (strings, basically, that
> identify files on secondary storage).
>
> Problem is, the "secondary storage" is currently just a shared NFS
> directory (on this cluster, by default, your home dir --which lives on
> /home at the head node's HD-- is mounted via NFS on every node). So
> you end up with a few hundred agents fighting each other while trying
> to create, write and read thousands of half-megabyte files to one
> shared directory (from one same physical drive) concurrently -- and
> that is seriously not scaling.
>
> Plus, the cluster features an InfiniBand switch, but apparently that's
> only used by MPI via MVAPICH, while NFS traffic goes through plain old
> Ethernet. While IB's ultra-low latency isn't that much of an advantage
> here (since message frequency is laughable compared to average task
> completion time), its high transfer rate could be very useful.
>
> So, on the one hand, I really should distribute the storage load (at
> the very least, the agents could store the new tasks they generate
> locally, and the pool could serve [taskID, hostWhoHasIt] pairs instead
> of just task IDs). But, on top of that, it would be very nice if I
> could make the agents exchange tasks through MPI directly; this would
> not only benefit of IB speed, but also avoid the passage through hard
> disk altogether.
>
> Any input, ideas, suggestions, warnings etc will be greatly  
> appreciated.
>
> TIA,
> Nicolás
>