[MPICH] mpiexec crash using MPICH_PORT_RANGE

Michele Trenti trenti at stsci.edu
Fri Jul 14 16:26:17 CDT 2006


Dear all,

I am setting up for the first time MPICH2 on a small Linux Sun Opteron
64bits (dual cpu) cluster where the admin has a strict security policy, so 
I need to limit as much as possible the port range used. From previous 
posts in this list by Martin Schwinzerl (2006/07/06 and 2006/07/05) and
Ralph Butler (2006/07/06) I see that the job can be done using
MPICH_PORT_RANGE.

I managed to do this using mpich2-1.0.4-rc1 as suggested by Ralph
Butler. I set up a port range with 10 ports and tried rings of up to 5
nodes. Everything runs smoothly when the number of processes is
smaller than or equal to the number of hosts. For higher number of
processes mpiexec crashes (see below).

My questions are:

(1) Is the crash related to a port range too small?

(2) If this is the case, what is the minimum number of ports to be defined 
in MPICH_PORT_RANGE for a cluster of N nodes (each node being a dual CPU 
unit) in order to have a properly working MPI environment?

Thanks a lot for your help,

Michele Trenti

------------------------------------------------
example:

udf2> mpdboot -n 3 -f mpd.host

udf2> mpdtrace
udf2
udf4
udf3

udf3> echo $PORT_RANGE
47530:47540

udf3> echo $MPICH_PORT_RANGE
47530:47540

udf3> echo $MPIEXEC_PORT_RANGE
47530:47540

udf3> mpiexec -n 3 ./cpi
Process 0 of 3 is on udf3.stsci.edu
Process 1 of 3 is on udf2.stsci.edu
Process 2 of 3 is on udf4.stsci.edu
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.002473

udf3> mpiexec -n 4 ./cpi
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(225)........: Initialization failed
MPID_Init(81)................: channel initialization failed
MPIDI_CH3_Init(35)...........:
MPIDI_CH3I_Progress_init(305):
MPIDU_Sock_listen(399).......: unable to bind socket to port 
(port=5856760,errno=98:(strerror() not found))
rank 0 in job 2  udf3.stsci.edu_47530   caused collective abort of all 
ranks
   exit status of rank 0: return code 13
udf3>

----------------------------------------
A similar test for a larger ring, same PORT_RANGE:

udf6> mpdboot -n 5 -f mpd.host

udf6> mpdtrace
udf6
udf4
udf3
udf5
udf2

udf6> mpiexec -l -n 5 hostname
0: udf6.stsci.edu
1: udf4.stsci.edu
2: udf3.stsci.edu
4: udf5.stsci.edu
3: udf2.stsci.edu
udf6> mpiexec -l -n 6 hostname
mpiexec_udf6.stsci.edu (mpiexec 443): mpiexec: from man, invalid msg=:{}:

-----------------------------------------

System information :
-----------------------------

* mpich2-1.0.4-rc1, compiled with gcc version 3.4.5, o/s Red Hat
Enterprise Linux WS R4 Kernel 2.6.9-34.0.2.ELsmp, python2.4,

* network of up to 5 dual cpu Sun opteron 64bits,

* firewall set to allow ssh/sshd from all IPs, plus all communications
among cluster members on given PORT_RANGE

---------------------------------------------------------------------

Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive                       Phone: +1 410 338 4987
Baltimore MD 21218 U.S.                       Fax: +1 410 338 4767

"For every complex natural phenomenon there is a simple, elegant,
compelling, wrong explanation."
                                             Thomas Gold





More information about the mpich-discuss mailing list