[MPICH] MPICH2 does not work over Windows XP network

Ben Held ben.held at staarinc.com
Wed Nov 2 08:12:03 CST 2005


I am having trouble getting MPICH2 to work on my Windows network.

 

We have a very simple network arrangement.  We have several computers, all
running Windows XP Pro SP2.  All have the same username and passwords on
them.  No domains or anything like that.  All machines are part of a
workgroup called STAR.  The 2 machines I am working on are STAR8 (IP is
192.168.0.9) and MOBILE3 (IP is 192.168.0.15).  The Windows firewall is
turned off on both machines.  No other firewall software is running.  I can
successfully run network jobs with other MPI implementations such as MPICH
1.2.5, WMPI, and WMPI-II.  I can run parallel jobs on my local machine with
MPICH2 fine.

 

When I try to run a mpi job with MPICH2 from STAR8 by typing:

 

C:\Projects\V8MPICH2\ReleaseExes>mpiexec -n 2 -machinefile machfile.txt
dfemr.exe

 

I get an error message:

 

abort: unable to connect to mobile3

 

At this same command line, I can successfully run "dfemr.exe", so I know
this is not a dll problem.  I can also run "ping STAR8" and "ping MOBILE3"
successfully.  Therefore this is not some sort of DNS lookup as both
machines can find each other by name.  My machfile.txt file reads:

 

star8:1

mobile3:1

 

With regards to the test code (below), I build it and when I run it from
STAR8 I see:

 

C:\Projects\V8MPICH2\MPICH2TestA\Debug>MPICH2TestA.exe mobile3

mobile3 = 192.168.0.15

 

and

 

C:\Projects\V8MPICH2\MPICH2TestA\Debug>MPICH2TestA.exe star8

star8 = 192.168.0.8

 

when I run it from mobile3, I see the exact same output.  

 

I'm not sure what to do next.  From my point of view, it appears that MPICH2
has not been validated over the network on Windows.  I doubt this could be
true, but it is about the only conclusion I can draw

 

 

 

Ben Held

Simulation Technology & Applied Research, Inc.
11520 N. Port Washington Rd., Suite 101B Mequon, WI 53092
P: +1 (262) 240-0291 x101
F: +1 (262) 240-0294
W:  <http://www.staarinc.com/> http://www.staarinc.com

 

 

Ben,

 

If you want to discuss your problem with others you can go here:

http://www-unix.mcs.anl.gov/mpi/mpich2/maillist.htm

 

I've attached code that looks up host names just like the MPICH2 code does.
Can you run the following code like this: "ghbn.exe mobile3" and "ghbn.exe
star8"

 

If it returns an IP address then there is something wrong with MPICH2.  If
it returns an error then there is something wrong with your network setup.

 

-David Ashton

 

#include <winsock2.h>

#include <windows.h>

#include <stdio.h>

#include <string.h>

 

static void translate_error(int error, char *msg, char *prepend)

{

    HLOCAL str;

    int num_bytes;

    num_bytes = FormatMessage(

      FORMAT_MESSAGE_FROM_SYSTEM |

      FORMAT_MESSAGE_ALLOCATE_BUFFER,

      0,

      error,

      MAKELANGID( LANG_NEUTRAL, SUBLANG_DEFAULT ),

      (LPTSTR) &str,

      0,0);

    if (num_bytes == 0)

    {

      if (prepend != NULL)

          strncpy(msg, prepend, 1024);

      else

          *msg = '\0';

    }

    else

    {

      if (prepend == NULL)

          memcpy(msg, str, num_bytes+1);

      else

          _snprintf(msg, 1024, "%s%s", prepend, (const char*)str);

      LocalFree(str);

      strtok(msg, "\r\n");

    }

}

 

int main(int argc, char *argv[])

{

    struct hostent *lphost;

    struct sockaddr_in sockAddr;

    char host[100];

    char err_msg[1024];

    WSADATA wsaData;

    int error;

 

    if (argc < 2)

    {

      printf("usage: %s <hostname>\n", argv[0]);

      return -1;

    }

    strcpy(host, argv[1]);

 

    if ((error = WSAStartup(MAKEWORD(2, 0), &wsaData)) != 0)

    {

      printf("unable to initialize the sockets library\n");

      return -1;

    }

 

    memset(&sockAddr, 0, sizeof(sockAddr));

 

    sockAddr.sin_family = AF_INET;

 

    lphost = gethostbyname(host);

    if (lphost != NULL)

    {

      sockAddr.sin_addr.s_addr = ((struct in_addr *)lphost->h_addr)->s_addr;

      printf("%s = %s\n", host, inet_ntoa(sockAddr.sin_addr));

    }

    else

    {

      error = WSAGetLastError();

      translate_error(error, err_msg, NULL);

      printf("gethostbyname failed with error %d: %s\n", error, err_msg);

    }

}

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20051102/f5d9c93c/attachment.htm>


More information about the mpich-discuss mailing list