[MPICH] MPICH1.2.4 _Condor Problem

Rajeev Thakur thakur at mcs.anl.gov
Wed Mar 7 15:37:21 CST 2007


MPICH doesn't use a fixed port. It uses ports assigned by the machine's IP
stack, which are in the ephemeral port range. See
http://www.ncftpd.com/ncftpd/doc/misc/ephemeral_ports.html
 
Rajeev


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Natarajan, Senthil
Sent: Wednesday, March 07, 2007 2:13 PM
To: Condor-Users Mail List; mpich-discuss at mcs.anl.gov
Cc: Greg Thain
Subject: [MPICH] MPICH1.2.4 _Condor Problem



Hi,

I am submitting MPI jobs, using MPICH1.2.4 through condor.

I have setup a separate user (something like "condor-user") to run condor
jobs on all the dedicated nodes. I created the certificates and copied to
all the nodes.

So the user ("condor-user") can ssh with out password, within all the nodes
and to its own node.

 

But the job fails and complaining about the connection refused to the same
machine. (I.e) the job runs on Machine A, couldn't not connect to Machine A.

 

Here is the error from one of the node.

 

connect to address xxx.xx.xxx.xx: Connection refused

connect to address xxx.xx.xxx.xx: Connection refused

trying normal rsh (/usr/bin/rsh)

MachineA: Connection refused

 

p0_20339:  p4_error: Timeout in making connection to remote process on
MachineA: 0

p0_20339: (301.989178) net_send: could not write to fd=4, errno = 32

 

By default MPI jobs (MPICH1.2.4) runs on what port? so that I can setup
firewall rules. Even I tried to set the port range like this

MPICH_PORT_RANGE=50001:59999

And allow the above ports in the firewall rule. But still having the
connection refused problem.

 

Could you please let me know what might be the problem?

 

Thanks,

Senthil

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070307/4f6098bc/attachment.htm>


More information about the mpich-discuss mailing list