[mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers
Jayesh Krishna
jayesh at mcs.anl.gov
Tue Jan 13 15:30:39 CST 2009
Hi,
From the error codes in the hostname tests it looks like Computer1 (Where
the shared network folder resides) is unable to handle the number of
connections to it.
############ Error code desc from MS ############
ERROR_REQ_NOT_ACCEP (71 0x47) : No more connections can be made to this
remote computer at this time because there are already as many connections
as the computer can accept.
############ Error code desc from MS ############
We should retry (but we do not) in this case.
Can you verify that the existing network mapped drive connections are
cleanedup in all the machines (Type "net use" in a command prompt on each
machine to view the existing network mapped conns)?
Regards,
Jayesh
_____
From: Tina Tina [mailto:gucigu at gmail.com]
Sent: Tuesday, January 13, 2009 3:21 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers
Dear Community!
I started testng with the exampel cpi.exe program (so the problem is not
in my program). I run the following command for all computers X=(1..8) and
everything worked ok:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -hosts 1 ComputerX -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe
Than I ran the following command:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir
X:\CPI\ -n X -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
X:\CPI\cpi.exe
Note: I also changed the machines.txt file as you suggested (adding :1).
The result was the following for X up to 5 it worked ok (I did only one
test run). But when I tested with X=6 (aka. on 6 computers). I got the
following error:
launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.
On next run with X=6:
launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error
3 - The system cannot find the path specified.
launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer6' failed, error
3 - The system cannot find the path specified.
launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer3' failed, error
3 - The system cannot find the path specified.
launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer5' failed, error
3 - The system cannot find the path specified.
launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer4' failed, error
3 - The system cannot find the path specified.
On next run with X=6:
I got the same error as on the first run.
And this errors were repeating on and on and on ... most of the times the
error with only one computer and in most cases it was the second computer
in the machinefile list. But not necesary. When there were more than one
launch failed errors (like in second case) the order could be also
different. In 20 tries not one was successfull.
Than just for kicks I tried with X=8 I got the same errors with random
number of launch failed errors and more or less random ComputerX that
reported this.
But every now or than I got one of the following errors (after the list of
launch failed errors):
1)
unable to post a write for the next command,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_writev(1768): An established connection was aborted by the
software in your host machine. (errno 10053)
unable to post a write of the close command to tear down the job tree as
part of the abort process.
unable to post an abort command.
2)
unable to post a read for the next command header,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_readv(1656): An existing connection was forcibly closed by
the remote host. (errno 10054)
unable to post a read for the next command on left context.
3)
unable to read the cmd header on the left context, socket connection
closed.
Hope this info helps
Regards
P.S.: I tried a couple of runs with X=5 and got mixed results, on some
runs it worked ok on some it did not. Basically the same as with my
program. So I would still say, as the number of computers increases, the
problem gets worse.
P.P.S.: Almost forgot to test the hostname. Here are the results of two
runs.
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)
*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)
*********** Warning ************
computer4
computer1
computer8
computer2
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)
*********** Warning ************
computer7
computer5
computer3
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)
*********** Warning ************
computer6
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)
*********** Warning ************
*********** Warning ************
Unable to map \\computer1\MPI$. (error 71)
*********** Warning ************
computer3
computer7
computer5
computer1
computer4
computer8
computer2
computer6
2009/1/13 Jayesh Krishna <jayesh at mcs.anl.gov>
Hi,
# Do you get any error message related to mapping network drives when you
ran your job ?
Please provide us with the command+output of your MPI job (Copy-paste
your complete mpiexec command and its output in your email).
# Can you run a command like (Note that I have removed "-noprompt"
option),
mpiexec -map x:\\computer1\MPI -wdir x:\ -n 8 -machinefile
testallnamesmf.txt hostname
with the following contents in the machinefile (testallnamesmf.txt -
contains all the computer/host names - Note that I specify that only 1 MPI
process be launched on each host using "hostname:1" syntax),
computer1:1 -ifhn 192.168.1.1
computer2:1 -ifhn 192.168.1.2
...
computer8:1 -ifhn 192.168.1.8
# Does your program fail consistently for certain computers ? Try running
a simple job (mpiexec -map x:\\computer1\MPI -wdir x:\ -n 1 -machinefile
testmf.txt hostname) with only specifying 1 computer/host at a time.
# Try removing "-noprompt" from the mpiexec command and see if mpiexec
prompts you for anything (password, inputs etc).
Regards,
Jayesh
_____
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Tina Tina
Sent: Tuesday, January 13, 2009 12:01 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers
Dear Community!
I am using the latest version of MPICH2 for Windows (the problem occurs
also on 1.0.8). I have 8 computers connected over giga-bit switch. I have
written a program that uses MPI for paralelization. When I run a program
on one or two computers. Everything works OK (lets say most of the time).
When I run it on 4 computers, sometimes it works and sometimes it does
not. The error that I get is:
launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed,
error 3 - The system cannot find the path specified.
Most times I get this error for one computer in machine list, but it can
also happen for 2 or more computers etc.
If I increase number of computers over 4. I get this error almost every
time. With 6 or more this happens every time. It looks like the higher the
number the worse it gets. I would really like to make this work. Has
anybody had such experiences and what was the solution.
It looks like the computer tries to start the program before the mapped
drive would be made operational. Is there any way to increase this delay?
Or are there any other settings that needs to be set?
There are some other errors that I occasionally get, but this is the most
important one (for now).
Systems:
Windows XP SP3 (on all computers)
Installed latest MPICH2
Connection giga-bit NICs (local network) over switch
Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map
X:\\computer1\MPI -wdir X:\ -n 4 -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe
\\computer1\MPI is a shared folder on computer1 from which the command is
run
machines.txt consists of following lines:
computer1 -ifhn 192.168.1.1
computer2 -ifhn 192.168.1.2
...
computer8 -ifhn 192.168.1.8
These are the NICs I would like MPI to use them for communication. The
order of computers in machines.txt is irrelevant (it happens on every
combination).
Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090113/165ab419/attachment.htm>
More information about the mpich-discuss
mailing list