[mpich-discuss] Re: [MPICH] build mpich2 with Myrinet GM

Wei-keng Liao wkliao at ece.northwestern.edu
Fri Feb 29 09:31:45 CST 2008


Darius,

The ROMIO alltoallv has no problem, but it still failed one of my test 
program. Attached is my test problem based on the alltoallv with the 
message size increased to 1MB per processes. The program just hung.

Wei-keng


On Thu, 28 Feb 2008, Darius Buntinas wrote:
> 
> Sorry, that last patch was against a different version.  Can you try this
> patch?  You might get some warnings about changes having already been applied
> since the previous patch already made some of the changes. 
>  You can ignore those.
> 
> -d
> 
> 
> On 02/28/2008 11:38 AM, Wei-keng Liao wrote:
> > I got an error when I applied the patch:
> > mercury::mpich2-1.0.6p1(11:34am) #448% patch -p0 < gm.patch
> > patching file src/mpid/ch3/include/mpidpre.h
> > patching file
> > src/mpid/ch3/channels/nemesis/nemesis/net_mod/gm_module/gm_module_impl.h
> > Hunk #1 succeeded at 51 (offset -1 lines).
> > patching file
> > src/mpid/ch3/channels/nemesis/nemesis/net_mod/gm_module/gm_module_poll.c
> > patching file
> > src/mpid/ch3/channels/nemesis/nemesis/net_mod/gm_module/gm_module_send.c
> > Hunk #2 FAILED at 233.
> > Hunk #3 succeeded at 265 (offset -80 lines).
> > Hunk #4 succeeded at 343 (offset -81 lines).
> > 1 out of 4 hunks FAILED -- saving rejects to file
> > src/mpid/ch3/channels/nemesis/nemesis/net_mod/gm_module/gm_module_send.c.rej
> > 
> > mercury::mpich2-1.0.6p1(11:36am) #450% cat
> > src/mpid/ch3/channels/nemesis/nemesis/net_mod/gm_module/gm_module_send.c.rej
> > ***************
> > *** 237,243 ****
> >   {
> >       int mpi_errno = MPI_SUCCESS;
> >       char *dataptr;
> > -     int datalen;
> >       int complete;  
> >   
> >       while (active_send || !SEND_Q_EMPTY())
> > --- 233,239 ----
> >   {
> >       int mpi_errno = MPI_SUCCESS;
> >       char *dataptr;
> > +     MPIDI_msg_sz_t datalen;
> >       int complete;  
> >   
> >       while (active_send || !SEND_Q_EMPTY())
> > 
> > Wei-keng
> > 
> > 
> > On Thu, 28 Feb 2008, Darius Buntinas wrote:
> > 
> > > Thanks for reporting this.  Here's a patch that should fix it.  Let me
> > > know if
> > > you have any more trouble.
> > >
> > > Thanks,
> > > -d
> > >
> > > On 02/27/2008 10:18 PM, Wei-keng Liao wrote:
> > > > OK. the patch fixed the problem and I was able to build the mpich. But
> > > > when
> > > > I ran the test alltoallv in test/mpi/coll using 4 processes, it failed
> > > > with
> > > > error message:
> > > >   rank 2 in job 1 tg-c527_40397 caused collective abort of all ranks
> > > >   exit status of rank 2: killed by signal 9 
> > > >
> > > > The gdb on the coredump shows
> > > > (gdb) where
> > > > #0  0x20000000001c9120 in ?? ()
> > > > #1  0x40000000000a71d0 in send_pkt ()
> > > > #2  0x40000000000a6530 in MPID_nem_gm_iSendContig ()
> > > > #3  0x40000000000ab000 in MPIDI_CH3_iSendv ()
> > > > #4  0x400000000003e890 in MPIDI_CH3_EagerContigIsend ()
> > > > #5  0x4000000000048160 in MPID_Isend ()
> > > > #6  0x400000000000ec00 in MPIC_Isend ()
> > > > #7  0x400000000000a8e0 in MPIR_Alltoallv ()
> > > > #8  0x400000000000b2d0 in PMPI_Alltoallv ()
> > > > #9  0x4000000000003670 in main ()
> > > > #10 0x40000000000a71d0 in send_pkt ()
> > > >
> > > > Wei-keng
> > > >
> > > >
> > > > On Wed, 27 Feb 2008, Darius Buntinas wrote:
> > > >
> > > > > Sorry about that.  I guess I didn't test this on an itanium after
> > > > > making
> > > > > some changes there.
> > > > >
> > > > > I've attached a patch file that should fix this.  I'm still not sure
> > > > > why
> > > > > it's not working with your intel compiler.
> > > > >
> > > > > Apply the patch like this (from the mpich2 source directory)
> > > > >   patch -p0 < ia64_atomics.patch
> > > > >
> > > > > Then do a make clean and make.
> > > > >
> > > > > -d
> > > > >
> > > > > On 02/27/2008 11:33 AM, Wei-keng Liao wrote:
> > > > > > I got a different error when I built mpich with gcc 3.2.2 at
> > > > > > compiling
> > > > > > file
> > > > > > nemesis/src/mpid_nem_alloc.c. (I used ifort for FC environment
> > > > > > variable.)
> > > > > >
> > > > > > In file included from ../include/mpid_nem_impl.h:13,
> > > > > >                  from mpid_nem_alloc.c:7:
> > > > > > ../include/mpid_nem_atomics.h: In function `MPID_NEM_SWAP':
> > > > > > ../include/mpid_nem_atomics.h:27: warning: dereferencing `void *'
> > > > > > pointer
> > > > > > ../include/mpid_nem_atomics.h: In function `MPID_NEM_CAS':
> > > > > > ../include/mpid_nem_atomics.h:54: warning: dereferencing `void *'
> > > > > > pointer
> > > > > > ../include/mpid_nem_atomics.h: In function `MPID_NEM_FETCH_AND_INC':
> > > > > > ../include/mpid_nem_atomics.h:164: parse error before string
> > > > > > constant
> > > > > >
> > > > > > Also, I tried Intel icc 8.1.037 and it failed with the message as
> > > > > > icc
> > > > > > 9.0.032 and 9.1.046.
> > > > > >
> > > > > > Wei-keng
> > > > > >
> > > > > >
> > > > > > On Tue, 26 Feb 2008, Darius Buntinas wrote:
> > > > > > > It looks like the icc compiler you're using doesn't like the
> > > > > > > gcc-style
> > > > > > > inline
> > > > > > > assembly code.
> > > > > > >
> > > > > > > What version of icc do you have?
> > > > > > > Can you try compiling with gcc instead of icc?
> > > > > > >
> > > > > > > -d
> > > > > > >
> > > > > > > On 02/26/2008 12:32 PM, Wei-keng Liao wrote:
> > > > > > > > Attached are 3 files:
> > > > > > > >
> > > > > > > > out.configure  -  stdout from configure
> > > > > > > > out.make       -  stdout from make
> > > > > > > > config.log
> > > > > > > >
> > > > > > > > Wei-keng
> > > > > > > >
> > > > > > > > On Tue, 26 Feb 2008, Darius Buntinas wrote:
> > > > > > > >
> > > > > > > > > Can you send us the output of configure as well as config.log?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > -d
> > > > > > > > >
> > > > > > > > > On 02/26/2008 11:35 AM, Wei-keng Liao wrote:
> > > > > > > > > > I got an error during make:
> > > > > > > > > >
> > > > > > > > > > ../include/mpid_nem_atomics.h(31): catastrophic error:
> > > > > > > > > > #error
> > > > > > > > > > directive:
> > > > > > > > > > No
> > > > > > > > > > swap function defined for this architecture
> > > > > > > > > >   #error No swap function defined for this architecture
> > > > > > > > > >    ^
> > > > > > > > > > compilation aborted for mpid_nem_alloc.c (code 4)
> > > > > > > > > >
> > > > > > > > > > I am using configure options:
> > > > > > > > > >           --with-device=ch3:nemesis:gm  \
> > > > > > > > > >           --with-gm=/opt/gm \
> > > > > > > > > >           --enable-f77 --enable-f90 --enable-cxx \
> > > > > > > > > >           --enable-fast \
> > > > > > > > > >           --enable-romio \
> > > > > > > > > >           --without-mpe \
> > > > > > > > > >           --with-file-system=ufs
> > > > > > > > > >
> > > > > > > > > > and the command "uname -a" on the machine is
> > > > > > > > > > Linux tg-login4 2.4.21-309.tg1 #1 SMP Thu Jun 1 17:07:28 CDT
> > > > > > > > > > 2006
> > > > > > > > > > ia64
> > > > > > > > > > unknown
> > > > > > > > > >
> > > > > > > > > > I am using Intel compiler v 9.1.043
> > > > > > > > > >
> > > > > > > > > > Wei-keng
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, 26 Feb 2008, Darius Buntinas wrote:
> > > > > > > > > > > On 02/26/2008 10:08 AM, Wei-keng Liao wrote:
> > > > > > > > > > > > I have a few questions on build mpich2-1.0.6p1 with
> > > > > > > > > > > > Myrinet
> > > > > > > > > > > > GM
> > > > > > > > > > > > library.
> > > > > > > > > > > >
> > > > > > > > > > > > On my target machine, the GM library (include, lib, bin,
> > > > > > > > > > > > etc.)
> > > > > > > > > > > > is in
> > > > > > > > > > > > /opt/gm. According to MPICH README, I used the 2 options
> > > > > > > > > > > > below
> > > > > > > > > > > > when
> > > > > > > > > > > > configuring: 
> > > > > > > > > > > >     --with-device=ch3:nemesis:gm  and --with-gm=/opt/gm
> > > > > > > > > > > >
> > > > > > > > > > > > I can see both libgm.a and libgm.so are in /opt/gm/lib.
> > > > > > > > > > > >
> > > > > > > > > > > > Q1: Do I need other configure options or setting
> > > > > > > > > > > > environment
> > > > > > > > > > > > variables
> > > > > > > > > > > >     (in addition to CC, FC, CXX, F90)? Should I set
> > > > > > > > > > > >     LDFLAGS
> > > > > > > > > > > >     to
> > > > > > > > > > > >     "-L/opt/gm/lib -lgm" ?
> > > > > > > > > > > Nope, the --with-gm=/opt/gm should take care of all of
> > > > > > > > > > > that
> > > > > > > > > > > for
> > > > > > > > > > > you.
> > > > > > > > > > >
> > > > > > > > > > > > Q2: Since nemesis does not support MPI dynamic process
> > > > > > > > > > > > routines
> > > > > > > > > > > > yet
> > > > > > > > > > > > and
> > > > > > > > > > > > I 
> > > > > > > > > > > >     need those routines, can I use
> > > > > > > > > > > >     --with-device=ch3:sock:gm
> > > > > > > > > > > >     instead?
> > > > > > > > > > > No, only nemesis supports gm.
> > > > > > > > > > >
> > > > > > > > > > > > Q3: Do I need anything else (source codes, library) from
> > > > > > > > > > > > Myrinet
> > > > > > > > > > > > to
> > > > > > > > > > > > build 
> > > > > > > > > > > >     mpich? Or the /opt/gm is good enough?
> > > > > > > > > > > All you need is libgm.a and gm.h.
> > > > > > > > > > >
> > > > > > > > > > > > Q4: Once the mpich is built, is there a way to verify
> > > > > > > > > > > > that
> > > > > > > > > > > > GM is
> > > > > > > > > > > > actually 
> > > > > > > > > > > >     used?
> > > > > > > > > > > Well, you should see a performance improvement over using
> > > > > > > > > > > sockets.
> > > > > > > > > > > Run a
> > > > > > > > > > > ping-pong test; you should see latencies around 10us or
> > > > > > > > > > > less.
> > > > > > > > > > >
> > > > > > > > > > > -d
> > > > > > > > > > >
> > >
> > 
> 
-------------- next part --------------
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define MSG_LEN 1048576

/*----< main() >------------------------------------------------------------*/
int main(int argc, char **argv) {
    int   i, rank, np;
    int  *send_count, *recv_count, *sdispls, *rdispls;
    char *s_buf, *r_buf;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &np);

    send_count = (int*)  malloc(np * sizeof(int));
    recv_count = (int*)  malloc(np * sizeof(int));
    sdispls    = (int*)  malloc(np * sizeof(int));
    rdispls    = (int*)  malloc(np * sizeof(int));
    s_buf      = (char*) malloc(np * MSG_LEN);
    r_buf      = (char*) malloc(np * MSG_LEN);

    for (i=0; i<np; i++) {
        send_count[i] = recv_count[i] =     MSG_LEN;
        sdispls[i]    = rdispls[i]    = i * MSG_LEN;
    }

    MPI_Alltoallv(s_buf, send_count, sdispls, MPI_CHAR,
                  r_buf, recv_count, rdispls, MPI_CHAR, MPI_COMM_WORLD);

    free(s_buf);      free(r_buf);
    free(sdispls);    free(rdispls);
    free(send_count); free(recv_count);

    MPI_Finalize();
    return 0;
}



More information about the mpich-discuss mailing list