[MPICH] mpiexec crash using MPICH_PORT_RANGE

Michele Trenti trenti at stsci.edu
Fri Jul 14 23:13:34 CDT 2006


Hi Ralph,

thank you very much for your prompt answer!

I will experiment starting from the numbers that you suggested and if the 
list is interested in, I will later summarize my findings in a post.

Best,

Michele

Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive                       Phone: +1 410 338 4987
Baltimore MD 21218 U.S.                       Fax: +1 410 338 4767

"For every complex natural phenomenon there is a simple, elegant,
compelling, wrong explanation."
                                             Thomas Gold



On Fri, 14 Jul 2006, Ralph Butler wrote:

> Hi Michele:
>
> The port range is a relatively new concept and we have not spent
> much time looking into these kinds of minumums.  However, for starters,
> I think you can probably estimate a minimum of 6 ports per process
> running on any given host just for the mpd, manager, etc running
> there.  This does not include the set of ports tied up by the MPI
> code invoked by application processes.
>
> So, if you are running a job that causes 2 processes to land on a
> single host, you probably want the range to be at least 12 wide
> even for non-mpi programs, and perhaps more for mpi pgms.  Some
> experimentation can help to pin it down a bit more for specific apps.
>
> On FriJul 14, at Fri Jul 14 4:26PM, Michele Trenti wrote:
>
>> Dear all,
>> 
>> I am setting up for the first time MPICH2 on a small Linux Sun Opteron
>> 64bits (dual cpu) cluster where the admin has a strict security policy, so 
>> I need to limit as much as possible the port range used. From previous 
>> posts in this list by Martin Schwinzerl (2006/07/06 and 2006/07/05) and
>> Ralph Butler (2006/07/06) I see that the job can be done using
>> MPICH_PORT_RANGE.
>> 
>> I managed to do this using mpich2-1.0.4-rc1 as suggested by Ralph
>> Butler. I set up a port range with 10 ports and tried rings of up to 5
>> nodes. Everything runs smoothly when the number of processes is
>> smaller than or equal to the number of hosts. For higher number of
>> processes mpiexec crashes (see below).
>> 
>> My questions are:
>> 
>> (1) Is the crash related to a port range too small?
>> 
>> (2) If this is the case, what is the minimum number of ports to be defined 
>> in MPICH_PORT_RANGE for a cluster of N nodes (each node being a dual CPU 
>> unit) in order to have a properly working MPI environment?
>> 
>> Thanks a lot for your help,
>> 
>> Michele Trenti
>> 
>> ------------------------------------------------
>> example:
>> 
>> udf2> mpdboot -n 3 -f mpd.host
>> 
>> udf2> mpdtrace
>> udf2
>> udf4
>> udf3
>> 
>> udf3> echo $PORT_RANGE
>> 47530:47540
>> 
>> udf3> echo $MPICH_PORT_RANGE
>> 47530:47540
>> 
>> udf3> echo $MPIEXEC_PORT_RANGE
>> 47530:47540
>> 
>> udf3> mpiexec -n 3 ./cpi
>> Process 0 of 3 is on udf3.stsci.edu
>> Process 1 of 3 is on udf2.stsci.edu
>> Process 2 of 3 is on udf4.stsci.edu
>> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
>> wall clock time = 0.002473
>> 
>> udf3> mpiexec -n 4 ./cpi
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> MPIR_Init_thread(225)........: Initialization failed
>> MPID_Init(81)................: channel initialization failed
>> MPIDI_CH3_Init(35)...........:
>> MPIDI_CH3I_Progress_init(305):
>> MPIDU_Sock_listen(399).......: unable to bind socket to port 
>> (port=5856760,errno=98:(strerror() not found))
>> rank 0 in job 2  udf3.stsci.edu_47530   caused collective abort of all 
>> ranks
>>  exit status of rank 0: return code 13
>> udf3>
>> 
>> ----------------------------------------
>> A similar test for a larger ring, same PORT_RANGE:
>> 
>> udf6> mpdboot -n 5 -f mpd.host
>> 
>> udf6> mpdtrace
>> udf6
>> udf4
>> udf3
>> udf5
>> udf2
>> 
>> udf6> mpiexec -l -n 5 hostname
>> 0: udf6.stsci.edu
>> 1: udf4.stsci.edu
>> 2: udf3.stsci.edu
>> 4: udf5.stsci.edu
>> 3: udf2.stsci.edu
>> udf6> mpiexec -l -n 6 hostname
>> mpiexec_udf6.stsci.edu (mpiexec 443): mpiexec: from man, invalid msg=:{}:
>> 
>> -----------------------------------------
>> 
>> System information :
>> -----------------------------
>> 
>> * mpich2-1.0.4-rc1, compiled with gcc version 3.4.5, o/s Red Hat
>> Enterprise Linux WS R4 Kernel 2.6.9-34.0.2.ELsmp, python2.4,
>> 
>> * network of up to 5 dual cpu Sun opteron 64bits,
>> 
>> * firewall set to allow ssh/sshd from all IPs, plus all communications
>> among cluster members on given PORT_RANGE
>> 
>> ---------------------------------------------------------------------
>> 
>> Michele Trenti
>> Space Telescope Science Institute
>> 3700 San Martin Drive                       Phone: +1 410 338 4987
>> Baltimore MD 21218 U.S.                       Fax: +1 410 338 4767
>> 
>> "For every complex natural phenomenon there is a simple, elegant,
>> compelling, wrong explanation."
>>                                            Thomas Gold
>> 
>> 
>> 
>




More information about the mpich-discuss mailing list