[mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers

Jayesh Krishna jayesh at mcs.anl.gov
Tue Jan 13 13:30:51 CST 2009


Hi,
# Do you get any error message related to mapping network drives when you
ran your job ?
 Please provide us with the command+output of your MPI job (Copy-paste
your complete mpiexec command and its output in your email).
 
# Can you run a command like (Note that I have removed "-noprompt"
option), 
 
        mpiexec -map x:\\computer1\MPI -wdir x:\ -n 8 -machinefile
testallnamesmf.txt hostname
 
  with the following contents in the machinefile (testallnamesmf.txt -
contains all the computer/host names - Note that I specify that only 1 MPI
process be launched on each host using "hostname:1" syntax),
 
computer1:1 -ifhn 192.168.1.1
computer2:1 -ifhn 192.168.1.2
...
computer8:1 -ifhn 192.168.1.8
 
# Does your program fail consistently for certain computers ? Try running
a simple job (mpiexec -map x:\\computer1\MPI -wdir x:\ -n 1 -machinefile
testmf.txt hostname) with only specifying 1 computer/host at a time.
 
# Try removing "-noprompt" from the mpiexec command and see if mpiexec
prompts you for anything (password, inputs etc).
 
Regards,
Jayesh

  _____  

From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Tina Tina
Sent: Tuesday, January 13, 2009 12:01 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I am using the latest version of MPICH2 for Windows (the problem occurs
also on 1.0.8). I have 8 computers connected over giga-bit switch. I have
written a program that uses MPI for paralelization. When I run a program
on one or two computers. Everything works OK (lets say most of the time).
When I run it on 4 computers, sometimes it works and sometimes it does
not. The error that I get is:
launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed,
error 3 - The system cannot find the path specified.

Most times I get this error for one computer in machine list, but it can
also happen for 2 or more computers etc.

If I increase number of computers over 4. I get this error almost every
time. With 6 or more this happens every time. It looks like the higher the
number the worse it gets. I would really like to make this work. Has
anybody had such experiences and what was the solution.

It looks like the computer tries to start the program before the mapped
drive would be made operational. Is there any way to increase this delay?
Or are there any other settings that needs to be set?

There are some other errors that I occasionally get, but this is the most
important one (for now).

Systems:
Windows XP SP3 (on all computers)
Installed latest MPICH2
Connection giga-bit NICs (local network) over switch

Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map
X:\\computer1\MPI -wdir X:\ -n 4 -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe

\\computer1\MPI is a shared folder on computer1 from which the command is
run

machines.txt consists of following lines:
computer1 -ifhn 192.168.1.1
computer2 -ifhn 192.168.1.2
...
computer8 -ifhn 192.168.1.8

These are the NICs I would like MPI to use them for communication. The
order of computers in machines.txt is irrelevant (it happens on every
combination).

Regards

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090113/2ceaa6e7/attachment.htm>


More information about the mpich-discuss mailing list