[mpich-discuss] Configuration considerations for 1500+ nodes

Hiatt, Dave M dave.m.hiatt at citi.com
Tue Oct 20 13:13:35 CDT 2009

It is indeed a limit.  Most version of Linux default to 1024 unless you go in and actively change it or configure the default initially.  The number of open files, allocations, and sockets is limited to the ulimit (which defaults to 1024).  So if you have a larger grid, say more than 1000 processors, and need them to be able to communicate with individually with say Node 0, increase the ulimit.

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov]On Behalf Of Hiatt, Dave M
Sent: Monday, October 19, 2009 3:51 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] Configuration considerations for 1500+ nodes

I'm just getting started trying to scale up a working app.  It runs on 400 and 800 nodes, but scaling to 1500 and above I get "Connection refused" errors as the true computation starts.  I'm running RH 5.3 and I'm wondering if this might be some kind of OS resource limitation that MPICH2 is running into trying to open sockets for each of the nodes.

The profile of operation is that each node sends a very small (44 bytes) message on start up as the app "takes roll" as it were.  And when I get large enough numbers of nodes, I start seeing this kind of behavior.  Up to now I've been using the defaults in my build of MPICH2 (I'm running 1.07).  There are a number of possible configuration parameters I could change in MPI.  I'm using CH3 and was thinking of changing that to Nemesis, but I'm more suspicious of the OS on this particular error because the compute nodes are being refused connection, not having it fail, so it really sounds like an OS resource issue.

My guess is that I need to configure my master to cope with large numbers of connections, but was wonder if anyone had suggestions on where to start.


If you lived here you'd be home by now
Dave Hiatt
Manager, Market Risk Systems Integration
CitiMortgage, Inc.
1000 Technology Dr.
Third Floor East, M.S. 55
O'Fallon, MO 63368-2240

Phone:  636-261-1408
Mobile: 314-452-9165
FAX:    636-261-1312
Email:     Dave.M.Hiatt at citigroup.com

mpich-discuss mailing list
mpich-discuss at mcs.anl.gov

More information about the mpich-discuss mailing list