[MPICH] MPICH1.2.4 _Condor Problem
Natarajan, Senthil
senthil at pitt.edu
Wed Mar 7 14:13:25 CST 2007
Hi,
I am submitting MPI jobs, using MPICH1.2.4 through condor.
I have setup a separate user (something like "condor-user") to run
condor jobs on all the dedicated nodes. I created the certificates and
copied to all the nodes.
So the user ("condor-user") can ssh with out password, within all the
nodes and to its own node.
But the job fails and complaining about the connection refused to the
same machine. (I.e) the job runs on Machine A, couldn't not connect to
Machine A.
Here is the error from one of the node.
connect to address xxx.xx.xxx.xx: Connection refused
connect to address xxx.xx.xxx.xx: Connection refused
trying normal rsh (/usr/bin/rsh)
MachineA: Connection refused
p0_20339: p4_error: Timeout in making connection to remote process on
MachineA: 0
p0_20339: (301.989178) net_send: could not write to fd=4, errno = 32
By default MPI jobs (MPICH1.2.4) runs on what port? so that I can setup
firewall rules. Even I tried to set the port range like this
MPICH_PORT_RANGE=50001:59999
And allow the above ports in the firewall rule. But still having the
connection refused problem.
Could you please let me know what might be the problem?
Thanks,
Senthil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070307/26152e54/attachment.htm>
More information about the mpich-discuss
mailing list