[MPICH] Problem setting up MPICH between a 32 bit INTEL and a 32 bit AMD machine

Krishna Chaitanya kris.c1986 at gmail.com
Thu Feb 28 13:42:12 CST 2008


Sorry for the typo :
> To make this happen, MPID_Progress_Wait must never get called
and so the MPI_Wait must constantly. Is that correct?

 To make this happen, MPID_Progress_Wait must never get called
and so the MPI_Wait must constantly poll. Is that correct?

Krishna Chaitanya K

On 2/28/08, Krishna Chaitanya <kris.c1986 at gmail.com> wrote:
>         Thanks for that,Dave. It was quite helpful.
>         I just started looking into the non-blocking calls today. I am
> not sure of whats happening in the call : MPIR_Grequest_progress_poke.
> What exactly is the difference between a generalized request and a
> native request?
>         I am also keen on knowing how an unexpected message is handled
> by MPICH2 and I have understood what happens in the function
> MPIDI_CH3U_Recvq_FDU_or_AEP().  So, I had two ranks(ranks 0 and 1)
> executing MPI_Isend() and one rank(rank 2) executing the two matching
> MPI_Irecv() calls, with the hope that the message that is  arriving
> from rank 1 would be unexpected from rank 2's standpoint.  But, as it
> turns out, MPI_Wait() actually calls MPID_Progress_Wait(). This is a
> blocking call and the MPI_Isend() that is being executed by rank1 is
> blocked until rank2 executes the corresponding MPI_Irecv().
>         My question is, how can I get an incoming message at rank 2 to
> get into the unexpected queue and wait there till the receiver scans
> through the  queue until a match is found and proceeds to transfer of
> data?
>         To make this happen, MPID_Progress_Wait must never get called
> and so the MPI_Wait must constantly. Is that correct?
>
> Thanks,
> Krishna Chaitanya K,
> Final Year B-TECH,
> Information Technology,
> National Institute of Technology,Karnataka,
> India
>
>
> On 2/20/08, Dave Goodell <goodell at mcs.anl.gov> wrote:
> > Unfortunately, this state machine and the design goals behind it are
> > not very well documented.  The only real documentation that I know of
> > is a diagram of the FSM that I pieced together from reading the code
> > during debugging: http://wiki.mcs.anl.gov/mpich2/index.php/
> > Sock_conn_protocol
> >
> > It's not 100% complete, and it doesn't explain very much about the
> > meaning behind any of the states or the code itself.  However, if you
> > look at the code alongside this diagram, it should help you in trying
> > to make sense of it.
> >
> > Generally, states like LSEND and LRECV are the listen side of the
> > connection, while CSEND and CRECV are the connect (initiating) side
> > of the connection.
> >
> > -Dave
> >
> > On Feb 19, 2008, at 11:00 PM, Krishna Chaitanya wrote:
> >
> > > Hi,
> > >        Just out of curiosity, though I am not trying to do anything
> > > with the control signals that are exchanged during the progress
> > > engine, I wish to know what exactly the LSEND , LRECV and the
> > > like,are.
> > >
> > > Thanks,
> > > Krishna Chaitanya K
> > >
> > > On Feb 19, 2008 12:32 PM, Krishna Chaitanya <kris.c1986 at gmail.com>
> > > wrote:
> > > Hi Dave,
> > >      Thanks for that. I was pretty much lost over the last couple of
> > > days. Will give it a fresh try again.
> > >      About the AMD machine. I should be able to have access to it in
> > > about 7-8 hours.
> > >
> > > Thanks,
> > > Krishna Chaitanya K
> > >
> > > On 2/19/08, Dave Goodell <goodell at mcs.anl.gov> wrote:
> > > > responses inline
> > > >
> > > > On Feb 18, 2008, at 10:35 PM, Krishna Chaitanya wrote:
> > > > > Sorry for the delay.
> > > > > >Can you ping from one to the other
> > > > >           Yes, I was able to ssh into the other machine and try
> > > > > mpdcheck and the rest. Will try to figure out what the problem is.
> > > >
> > > > Be sure that you actually perform a ping between the two hosts in
> > > > question.  If you ssh'd in from a third host to both of them, then
> > > > you don't have proof of proper routing between the two compute
> > > nodes.
> > > >
> > > > > In the mean-time, I have been trying to understand the progress
> > > > > engine by tracing a standard blocking mode send/recv program, on
> > > > > one machine. ( by using mpdboot -n 1). What exactly are the .i
> > > > > files in the directory /mpid/common/sock/poll for?
> > > > > I noticed that a function like "MPIDU_Sock_post_readv" is at :
> > > > > 1) src/mpid/common/sock/iocp/sock.c, which includes functions like
> > > > > "WSARecv",which is a function to receive data from a socket in
> > > > > windows. ( I am working on a linux platform)
> > > > > 2)/mpich-src/src/mpid/common/sock/poll/sock_post.i.
> > > > >              Interestingly, I am not able to navigate through the
> > > > > macros and functions in this file,by using tags (Why? ) . So, I
> > > can
> > > > > only see that we are playing around with pointers to update the
> > > > > pollinfo structure. Where is this structure defined? The .i file
> > > > > does not include any .h file. I tried "grep" on the main dir to
> > > > > locate the definition, it didnt return anything useful.
> > > > >              Can someone point me to a wiki article or any
> > > > > documentation that gives some info on the .i files?
> > > >
> > > > There are two implementations of the sock code: "iocp" is the
> > > Windows
> > > > implementation and "poll" is the unix-style implementation.  Only
> > > one
> > > > of the two directories will be used in any particular build.  In
> > > your
> > > > case, the "poll" directory will be chosen.
> > > >
> > > > As for the *.i files, they confused me the first time that I saw
> > > > them.  If you look at src/mpid/common/sock/poll/sock.c:215-222
> > > you'll
> > > > see that they are included via the C preprocessor.  I don't know the
> > > > rationale for this approach as the code was written before I joined
> > > > the project.  It is likely that your ctags program is not indexing
> > > > these *.i files because they don't end in *.h or *.c.  You can
> > > > probably convince it to index the *.i files as well with a
> > > > configuration file or some command-line switches (which will vary
> > > > among various ctags implementations).
> > > >
> > > > "struct pollinfo" is also defined in that same sock.c file.
> > > >
> > > > Hope that helps,
> > > > -Dave
> > > >
> > > > > Thanks,
> > > > > Krishna Chaitanya K
> > > > >
> > > > > On Feb 15, 2008 3:22 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> > > > > What evidence do you have that the two machines are able to see
> > > each
> > > > > other on the network?  Can you ping from one to the other (and
> > > vice
> > > > > versa)?  What is the output of the 'route' command on each of the
> > > > > hosts?
> > > > >
> > > > > -Dave
> > > > >
> > > > > On Feb 14, 2008, at 10:30 PM, Krishna Chaitanya wrote:
> > > > >
> > > > > > Hi,
> > > > > >           Turns out that the settings in the /etc/hosts file
> > > on the
> > > > > > AMD machine was incorrect. So, mpdcheck -v -f mpd.hosts gives
> > > this :
> > > > > >
> > > > > > AMD machine : ( outwit )
> > > > > > kc at outwit:~$ mpdcheck -v -f mpd.hosts
> > > > > > obtaining hostname via gethostname and getfqdn
> > > > > > gethostname gives  outwit
> > > > > > getfqdn gives  outwit.nitk.ac.in
> > > > > > checking out unqualified hostname; make sure is not "localhost",
> > > > > etc.
> > > > > > checking out qualified hostname; make sure is not
> > > "localhost", etc.
> > > > > > obtain IP addrs via qualified and unqualified hostnames;
> > > make sure
> > > > > > other than 127.0.0.1
> > > > > > gethostbyname_ex:  ('outwit.nitk.ac.in', ['outwit'],
> > > > > ['172.16.54.54'])
> > > > > > gethostbyname_ex:  ('outwit.nitk.ac.in', ['outwit'],
> > > > > ['172.16.54.54'])
> > > > > > checking that IP addrs resolve to same host
> > > > > > now do some gethostbyaddr and gethostbyname_ex for machines in
> > > > > > hosts file
> > > > > > checking gethostbyXXX for unqualified zeus
> > > > > > gethostbyname_ex:  ('zeus', [], ['172.16.54.71'])
> > > > > > checking gethostbyXXX for qualified zeus
> > > > > > gethostbyname_ex:  ('zeus', [], ['172.16.54.71'])
> > > > > >
> > > > > >
> > > > > > INTEL machine ( zeus )
> > > > > > kris.c1986 at zeus ~]$ mpdcheck -v -f mpd.hosts
> > > > > > obtaining hostname via gethostname and getfqdn
> > > > > > gethostname gives  zeus
> > > > > > getfqdn gives  zeus.nitk.ac.in
> > > > > > checking out unqualified hostname; make sure is not "localhost",
> > > > > etc.
> > > > > > checking out qualified hostname; make sure is not
> > > "localhost", etc.
> > > > > > obtain IP addrs via qualified and unqualified hostnames;
> > > make sure
> > > > > > other than 127.0.0.1
> > > > > > gethostbyname_ex:  ('zeus.nitk.ac.in', ['zeus'],
> > > ['172.16.54.71'])
> > > > > > gethostbyname_ex:  ('zeus.nitk.ac.in', ['zeus'],
> > > ['172.16.54.71'])
> > > > > > checking that IP addrs resolve to same host
> > > > > > now do some gethostbyaddr and gethostbyname_ex for machines in
> > > > > > hosts file
> > > > > > checking gethostbyXXX for unqualified outwit
> > > > > > gethostbyname_ex:  ('outwit', [], ['172.16.54.54'])
> > > > > > checking gethostbyXXX for qualified outwit
> > > > > > gethostbyname_ex:  ('outwit', [], ['172.16.54.54'])
> > > > > >
> > > > > >                Seems to be ok. But I still get this error when I
> > > > > > try mpdcheck -c on the AMD comp :
> > > > > > kc at outwit:~$ mpdcheck -c zeus 33737
> > > > > > Traceback (most recent call last):
> > > > > >   File "/home/kc/mpich-install/bin/mpdcheck", line 103, in
> > > <module>
> > > > > >     sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note
> > > > > > double parens
> > > > > >   File "<string>", line 1, in connect
> > > > > > socket.error: (113, 'No route to host')
> > > > > >
> > > > > >
> > > > > >            The two machines are able to see each other on the
> > > > > > network. Cant exaplain why it complains that there is "No
> > > route to
> > > > > > host"
> > > > > >
> > > > > > Krishna Chaitanya K
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 14, 2008 at 2:50 PM, Rajeev Thakur
> > > <thakur at mcs.anl.gov>
> > > > > > wrote:
> > > > > > The second test times out perhaps indicates that there might
> > > be a
> > > > > > firewall on the AMD machine. See the section A.3 of the
> > > > > > installation guide.
> > > > > >
> > > > > > Rajeev
> > > > > >
> > > > > > From: Krishna Chaitanya [mailto:kris.c1986 at gmail.com]
> > > > > > Sent: Thursday, February 14, 2008 11:41 AM
> > > > > > To: Rajeev Thakur
> > > > > > Cc: mpich-discuss at mcs.anl.gov
> > > > > > Subject: Re: [MPICH] Problem setting up MPICH between a 32 bit
> > > > > > INTEL and a 32 bit AMD machine
> > > > > >
> > > > > > So, what is the error trying to convey? Googling for it, gave
> > > this.
> > > > > > I have flushed the IPtables on both the machines and the
> > > firewalls
> > > > > > are de-activated. Could you please elaborate on what kind of
> > > > > > settings I need to look into?
> > > > > >
> > > > > > Thanks,
> > > > > > Krishna Chaitanya K
> > > > > >
> > > > > > On Thu, Feb 14, 2008 at 10:58 PM, Rajeev Thakur
> > > > > > <thakur at mcs.anl.gov> wrote:
> > > > > > It should be possible. mpdcheck is a tool to diagnose whether
> > > the
> > > > > > network configuration settings on the machines are ok or not,
> > > and
> > > > > > whether a process on one machine can talk to a process on the
> > > > > > other. It looks like the settings need to be fixed in some way.
> > > > > >
> > > > > > Rajeev
> > > > > >
> > > > > > From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-
> > > > > > discuss at mcs.anl.gov] On Behalf Of Krishna Chaitanya
> > > > > > Sent: Thursday, February 14, 2008 10:26 AM
> > > > > > To: mpich-discuss at mcs.anl.gov
> > > > > > Subject: [MPICH] Problem setting up MPICH between a 32 bit INTEL
> > > > > > and a 32 bit AMD machine
> > > > > >
> > > > > > Hi,
> > > > > >         In one of the previous posts, you had replied back
> > > saying
> > > > > > MPICH cannot be put to use between a 32 bit INTEL machine and
> > > a 64
> > > > > > bit AMD machine. Is it possible to do so between an INTEL and an
> > > > > > AMD machine, both of them being 32 bit processors?
> > > > > >         Anyway, on trying mpdcheck -f mpd.hosts on the 32 bit
> > > AMD,
> > > > > > I am getting the following error :
> > > > > >    ipaddr via uqn (208.67.216.130) does not match via fqn
> > > > > > (208.69.32.130)
> > > > > >         And if I try the mpdcheck -s on the AMD node and
> > > mpdcheck -
> > > > > > c on the INTEL node, the client times out. The test message gets
> > > > > > delivered with the client and server swapped.
> > > > > >
> > > > > > Thanks,
> > > > > > Krishna Chaitanya K
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > In the middle of difficulty, lies opportunity
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > In the middle of difficulty, lies opportunity
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > In the middle of difficulty, lies opportunity
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > In the middle of difficulty, lies opportunity
> > > >
> > > >
> > >
> > >
> > > --
> > > In the middle of difficulty, lies opportunity
> > >
> > >
> > >
> > > --
> > > In the middle of difficulty, lies opportunity
> >
> >
>
>
> --
> In the middle of difficulty, lies opportunity
>


-- 
In the middle of difficulty, lies opportunity




More information about the mpich-discuss mailing list