[MPICH] mpiexec crash using MPICH_PORT_RANGE
Ralph Butler
rbutler at mtsu.edu
Fri Jul 14 21:37:32 CDT 2006
Hi Michele:
The port range is a relatively new concept and we have not spent
much time looking into these kinds of minumums. However, for starters,
I think you can probably estimate a minimum of 6 ports per process
running on any given host just for the mpd, manager, etc running
there. This does not include the set of ports tied up by the MPI
code invoked by application processes.
So, if you are running a job that causes 2 processes to land on a
single host, you probably want the range to be at least 12 wide
even for non-mpi programs, and perhaps more for mpi pgms. Some
experimentation can help to pin it down a bit more for specific apps.
On FriJul 14, at Fri Jul 14 4:26PM, Michele Trenti wrote:
> Dear all,
>
> I am setting up for the first time MPICH2 on a small Linux Sun Opteron
> 64bits (dual cpu) cluster where the admin has a strict security
> policy, so I need to limit as much as possible the port range used.
> From previous posts in this list by Martin Schwinzerl (2006/07/06
> and 2006/07/05) and
> Ralph Butler (2006/07/06) I see that the job can be done using
> MPICH_PORT_RANGE.
>
> I managed to do this using mpich2-1.0.4-rc1 as suggested by Ralph
> Butler. I set up a port range with 10 ports and tried rings of up to 5
> nodes. Everything runs smoothly when the number of processes is
> smaller than or equal to the number of hosts. For higher number of
> processes mpiexec crashes (see below).
>
> My questions are:
>
> (1) Is the crash related to a port range too small?
>
> (2) If this is the case, what is the minimum number of ports to be
> defined in MPICH_PORT_RANGE for a cluster of N nodes (each node
> being a dual CPU unit) in order to have a properly working MPI
> environment?
>
> Thanks a lot for your help,
>
> Michele Trenti
>
> ------------------------------------------------
> example:
>
> udf2> mpdboot -n 3 -f mpd.host
>
> udf2> mpdtrace
> udf2
> udf4
> udf3
>
> udf3> echo $PORT_RANGE
> 47530:47540
>
> udf3> echo $MPICH_PORT_RANGE
> 47530:47540
>
> udf3> echo $MPIEXEC_PORT_RANGE
> 47530:47540
>
> udf3> mpiexec -n 3 ./cpi
> Process 0 of 3 is on udf3.stsci.edu
> Process 1 of 3 is on udf2.stsci.edu
> Process 2 of 3 is on udf4.stsci.edu
> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> wall clock time = 0.002473
>
> udf3> mpiexec -n 4 ./cpi
> [cli_0]: aborting job:
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(225)........: Initialization failed
> MPID_Init(81)................: channel initialization failed
> MPIDI_CH3_Init(35)...........:
> MPIDI_CH3I_Progress_init(305):
> MPIDU_Sock_listen(399).......: unable to bind socket to port
> (port=5856760,errno=98:(strerror() not found))
> rank 0 in job 2 udf3.stsci.edu_47530 caused collective abort of
> all ranks
> exit status of rank 0: return code 13
> udf3>
>
> ----------------------------------------
> A similar test for a larger ring, same PORT_RANGE:
>
> udf6> mpdboot -n 5 -f mpd.host
>
> udf6> mpdtrace
> udf6
> udf4
> udf3
> udf5
> udf2
>
> udf6> mpiexec -l -n 5 hostname
> 0: udf6.stsci.edu
> 1: udf4.stsci.edu
> 2: udf3.stsci.edu
> 4: udf5.stsci.edu
> 3: udf2.stsci.edu
> udf6> mpiexec -l -n 6 hostname
> mpiexec_udf6.stsci.edu (mpiexec 443): mpiexec: from man, invalid
> msg=:{}:
>
> -----------------------------------------
>
> System information :
> -----------------------------
>
> * mpich2-1.0.4-rc1, compiled with gcc version 3.4.5, o/s Red Hat
> Enterprise Linux WS R4 Kernel 2.6.9-34.0.2.ELsmp, python2.4,
>
> * network of up to 5 dual cpu Sun opteron 64bits,
>
> * firewall set to allow ssh/sshd from all IPs, plus all communications
> among cluster members on given PORT_RANGE
>
> ---------------------------------------------------------------------
>
> Michele Trenti
> Space Telescope Science Institute
> 3700 San Martin Drive Phone: +1 410 338 4987
> Baltimore MD 21218 U.S. Fax: +1 410 338 4767
>
> "For every complex natural phenomenon there is a simple, elegant,
> compelling, wrong explanation."
> Thomas Gold
>
>
>
More information about the mpich-discuss
mailing list