[MPICH] mpiexec crash using MPICH_PORT_RANGE
Michele Trenti
trenti at stsci.edu
Fri Jul 14 23:13:34 CDT 2006
Hi Ralph,
thank you very much for your prompt answer!
I will experiment starting from the numbers that you suggested and if the
list is interested in, I will later summarize my findings in a post.
Best,
Michele
Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive Phone: +1 410 338 4987
Baltimore MD 21218 U.S. Fax: +1 410 338 4767
"For every complex natural phenomenon there is a simple, elegant,
compelling, wrong explanation."
Thomas Gold
On Fri, 14 Jul 2006, Ralph Butler wrote:
> Hi Michele:
>
> The port range is a relatively new concept and we have not spent
> much time looking into these kinds of minumums. However, for starters,
> I think you can probably estimate a minimum of 6 ports per process
> running on any given host just for the mpd, manager, etc running
> there. This does not include the set of ports tied up by the MPI
> code invoked by application processes.
>
> So, if you are running a job that causes 2 processes to land on a
> single host, you probably want the range to be at least 12 wide
> even for non-mpi programs, and perhaps more for mpi pgms. Some
> experimentation can help to pin it down a bit more for specific apps.
>
> On FriJul 14, at Fri Jul 14 4:26PM, Michele Trenti wrote:
>
>> Dear all,
>>
>> I am setting up for the first time MPICH2 on a small Linux Sun Opteron
>> 64bits (dual cpu) cluster where the admin has a strict security policy, so
>> I need to limit as much as possible the port range used. From previous
>> posts in this list by Martin Schwinzerl (2006/07/06 and 2006/07/05) and
>> Ralph Butler (2006/07/06) I see that the job can be done using
>> MPICH_PORT_RANGE.
>>
>> I managed to do this using mpich2-1.0.4-rc1 as suggested by Ralph
>> Butler. I set up a port range with 10 ports and tried rings of up to 5
>> nodes. Everything runs smoothly when the number of processes is
>> smaller than or equal to the number of hosts. For higher number of
>> processes mpiexec crashes (see below).
>>
>> My questions are:
>>
>> (1) Is the crash related to a port range too small?
>>
>> (2) If this is the case, what is the minimum number of ports to be defined
>> in MPICH_PORT_RANGE for a cluster of N nodes (each node being a dual CPU
>> unit) in order to have a properly working MPI environment?
>>
>> Thanks a lot for your help,
>>
>> Michele Trenti
>>
>> ------------------------------------------------
>> example:
>>
>> udf2> mpdboot -n 3 -f mpd.host
>>
>> udf2> mpdtrace
>> udf2
>> udf4
>> udf3
>>
>> udf3> echo $PORT_RANGE
>> 47530:47540
>>
>> udf3> echo $MPICH_PORT_RANGE
>> 47530:47540
>>
>> udf3> echo $MPIEXEC_PORT_RANGE
>> 47530:47540
>>
>> udf3> mpiexec -n 3 ./cpi
>> Process 0 of 3 is on udf3.stsci.edu
>> Process 1 of 3 is on udf2.stsci.edu
>> Process 2 of 3 is on udf4.stsci.edu
>> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
>> wall clock time = 0.002473
>>
>> udf3> mpiexec -n 4 ./cpi
>> [cli_0]: aborting job:
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> MPIR_Init_thread(225)........: Initialization failed
>> MPID_Init(81)................: channel initialization failed
>> MPIDI_CH3_Init(35)...........:
>> MPIDI_CH3I_Progress_init(305):
>> MPIDU_Sock_listen(399).......: unable to bind socket to port
>> (port=5856760,errno=98:(strerror() not found))
>> rank 0 in job 2 udf3.stsci.edu_47530 caused collective abort of all
>> ranks
>> exit status of rank 0: return code 13
>> udf3>
>>
>> ----------------------------------------
>> A similar test for a larger ring, same PORT_RANGE:
>>
>> udf6> mpdboot -n 5 -f mpd.host
>>
>> udf6> mpdtrace
>> udf6
>> udf4
>> udf3
>> udf5
>> udf2
>>
>> udf6> mpiexec -l -n 5 hostname
>> 0: udf6.stsci.edu
>> 1: udf4.stsci.edu
>> 2: udf3.stsci.edu
>> 4: udf5.stsci.edu
>> 3: udf2.stsci.edu
>> udf6> mpiexec -l -n 6 hostname
>> mpiexec_udf6.stsci.edu (mpiexec 443): mpiexec: from man, invalid msg=:{}:
>>
>> -----------------------------------------
>>
>> System information :
>> -----------------------------
>>
>> * mpich2-1.0.4-rc1, compiled with gcc version 3.4.5, o/s Red Hat
>> Enterprise Linux WS R4 Kernel 2.6.9-34.0.2.ELsmp, python2.4,
>>
>> * network of up to 5 dual cpu Sun opteron 64bits,
>>
>> * firewall set to allow ssh/sshd from all IPs, plus all communications
>> among cluster members on given PORT_RANGE
>>
>> ---------------------------------------------------------------------
>>
>> Michele Trenti
>> Space Telescope Science Institute
>> 3700 San Martin Drive Phone: +1 410 338 4987
>> Baltimore MD 21218 U.S. Fax: +1 410 338 4767
>>
>> "For every complex natural phenomenon there is a simple, elegant,
>> compelling, wrong explanation."
>> Thomas Gold
>>
>>
>>
>
More information about the mpich-discuss
mailing list