[MPICH] mpiexec crash using MPICH_PORT_RANGE
Michele Trenti
trenti at stsci.edu
Fri Jul 14 16:26:17 CDT 2006
Dear all,
I am setting up for the first time MPICH2 on a small Linux Sun Opteron
64bits (dual cpu) cluster where the admin has a strict security policy, so
I need to limit as much as possible the port range used. From previous
posts in this list by Martin Schwinzerl (2006/07/06 and 2006/07/05) and
Ralph Butler (2006/07/06) I see that the job can be done using
MPICH_PORT_RANGE.
I managed to do this using mpich2-1.0.4-rc1 as suggested by Ralph
Butler. I set up a port range with 10 ports and tried rings of up to 5
nodes. Everything runs smoothly when the number of processes is
smaller than or equal to the number of hosts. For higher number of
processes mpiexec crashes (see below).
My questions are:
(1) Is the crash related to a port range too small?
(2) If this is the case, what is the minimum number of ports to be defined
in MPICH_PORT_RANGE for a cluster of N nodes (each node being a dual CPU
unit) in order to have a properly working MPI environment?
Thanks a lot for your help,
Michele Trenti
------------------------------------------------
example:
udf2> mpdboot -n 3 -f mpd.host
udf2> mpdtrace
udf2
udf4
udf3
udf3> echo $PORT_RANGE
47530:47540
udf3> echo $MPICH_PORT_RANGE
47530:47540
udf3> echo $MPIEXEC_PORT_RANGE
47530:47540
udf3> mpiexec -n 3 ./cpi
Process 0 of 3 is on udf3.stsci.edu
Process 1 of 3 is on udf2.stsci.edu
Process 2 of 3 is on udf4.stsci.edu
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.002473
udf3> mpiexec -n 4 ./cpi
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(225)........: Initialization failed
MPID_Init(81)................: channel initialization failed
MPIDI_CH3_Init(35)...........:
MPIDI_CH3I_Progress_init(305):
MPIDU_Sock_listen(399).......: unable to bind socket to port
(port=5856760,errno=98:(strerror() not found))
rank 0 in job 2 udf3.stsci.edu_47530 caused collective abort of all
ranks
exit status of rank 0: return code 13
udf3>
----------------------------------------
A similar test for a larger ring, same PORT_RANGE:
udf6> mpdboot -n 5 -f mpd.host
udf6> mpdtrace
udf6
udf4
udf3
udf5
udf2
udf6> mpiexec -l -n 5 hostname
0: udf6.stsci.edu
1: udf4.stsci.edu
2: udf3.stsci.edu
4: udf5.stsci.edu
3: udf2.stsci.edu
udf6> mpiexec -l -n 6 hostname
mpiexec_udf6.stsci.edu (mpiexec 443): mpiexec: from man, invalid msg=:{}:
-----------------------------------------
System information :
-----------------------------
* mpich2-1.0.4-rc1, compiled with gcc version 3.4.5, o/s Red Hat
Enterprise Linux WS R4 Kernel 2.6.9-34.0.2.ELsmp, python2.4,
* network of up to 5 dual cpu Sun opteron 64bits,
* firewall set to allow ssh/sshd from all IPs, plus all communications
among cluster members on given PORT_RANGE
---------------------------------------------------------------------
Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive Phone: +1 410 338 4987
Baltimore MD 21218 U.S. Fax: +1 410 338 4767
"For every complex natural phenomenon there is a simple, elegant,
compelling, wrong explanation."
Thomas Gold
More information about the mpich-discuss
mailing list