[mpich-discuss] mpich in Firmware

Wed Nov 7 11:03:38 CST 2012

Hi Bob,

MPICH2 is used as the basis for very efficient, low-level
implementations of MPI, but not using the mechanisms that you suggest.

Putting MPI in the operating system is not efficient because of the
overhead required to context switch into kernel space.  All of the
high-performance interconnects that I am familiar with try to avoid
the OS as much as possible.  A common thing to provide is user-space
RDMA, wherein the user process can write to remote memory without OS
involvement on either side.  In some cases (e.g. Infiniband),
registration of memory for RDMA is expensive, but the cost can be
amortized with registration caching.

On Cray Seastar (XT platform), the NIC could do tag and comm matching
in hardware without CPU involvement, which reduced the overhead of
send-recv.  On Blue Gene/P, remote interrupts can fire user-defined
callbacks without polling; furthermore, data is moved by RDMA to avoid
copy overhead.

QPI is not the appropriate protocol to use for MPI.  QPI is providing
load-store coherency between the memory and the processor sockets.
MPICH2 has a kernel module (nemesis) that allows send-recv to do
send-recv between processes within the same node using only
load-store, so this is going to leverage QPI already, but not using
anything specific to it.

To summarize, many vendors (e.g. Cray and IBM) work very hard to
implement hardware and OS features that enable MPI to be efficient.
This is a much better way to use human resources than for MPICH2 (or
any other MPI) to write to bare-metal APIs directly.  If you examine
the performance of MPI on Cray Seastar and IBM Blue Gene, you'll find
that the solutions used there are quite effective despite not
requiring huge changes to the MPICH2 code base.  Cray has a netmod for
their NICs.  IBM writes their own device (e.g. dcmfd or pamid in
MPICH2), but this is partly because their own middleware (e.g. DCMF or
PAMI) does a lot of the heavy lifting that would otherwise be inside
of ch3 (This is partly because IBM Blue Gene does not run Linux, but
rather a lightweight kernel, and for other reasons that are not
germane to this discussion.).

Best,

Jeff

On Mon, Oct 22, 2012 at 2:18 PM, Bob Ilgner <bobilgner at gmail.com> wrote:
> Hi Marcelo,That is a nice objective and does not need to change. The
> interface definitions should not change as they are magnificent.
>
> Most of the clusters I work on are Intel based blades and I was
> wondering whether one could gain some advantage by embedding the mpi
> somehow, in the same sense as QPI ? Intel already has a threading lib,
> no idea whether it can do the same as MPI though.
>
> Kind regards, bob
>
> On Mon, Oct 22, 2012 at 8:05 PM, Marcelo Soares Souza
> <marcelo at cebacad.net> wrote:
>>   I believe that one of main objective of mpich is to be portable,
>> multiplataform and decoupled from a specific plataform.
>>
>> 2012/10/22 Bob Ilgner <bobilgner at gmail.com>:
>>> In order to make the mpi process more efficient, has anyone considered
>>> implementing mpich2 as part of an operating system, or maybe even as
>>> firmware ? Has anything been attempted in this direction ?
>>
>> --
>> Abraços
>> Marcelo Soares Souza
>> http://marcelo.cebacad.net
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond