[mpich-discuss] Configuration considerations for 1500+ nodes
balaji at mcs.anl.gov
Mon Oct 19 16:33:50 CDT 2009
This does seem like an OS resource limitation. But before we go deeper
into that, it might be worthwhile to upgrade to the latest MPICH2
release (1.2). 1.0.7 is very old.
On 10/19/2009 03:51 PM, Hiatt, Dave M wrote:
> I'm just getting started trying to scale up a working app. It runs on 400 and 800 nodes, but scaling to 1500 and above I get "Connection refused" errors as the true computation starts. I'm running RH 5.3 and I'm wondering if this might be some kind of OS resource limitation that MPICH2 is running into trying to open sockets for each of the nodes.
> The profile of operation is that each node sends a very small (44 bytes) message on start up as the app "takes roll" as it were. And when I get large enough numbers of nodes, I start seeing this kind of behavior. Up to now I've been using the defaults in my build of MPICH2 (I'm running 1.07). There are a number of possible configuration parameters I could change in MPI. I'm using CH3 and was thinking of changing that to Nemesis, but I'm more suspicious of the OS on this particular error because the compute nodes are being refused connection, not having it fail, so it really sounds like an OS resource issue.
> My guess is that I need to configure my master to cope with large numbers of connections, but was wonder if anyone had suggestions on where to start.
> If you lived here you'd be home by now
> Dave Hiatt
> Manager, Market Risk Systems Integration
> CitiMortgage, Inc.
> 1000 Technology Dr.
> Third Floor East, M.S. 55
> O'Fallon, MO 63368-2240
> Phone: 636-261-1408
> Mobile: 314-452-9165
> FAX: 636-261-1312
> Email: Dave.M.Hiatt at citigroup.com
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
More information about the mpich-discuss