pNetCDF problem with WRF on large core counts

Mon Aug 5 08:50:50 CDT 2013

On Mon, Aug 05, 2013 at 06:38:56AM -0600, Michalakes, John wrote:
> Hi,
> 
> We're running into problems running WRF with pNetCDF and it may have something to do with which MPI implementation we use.  Both Intel MPI and OpenMPI fail (hang processes) on MPI task counts greater that 256.  Mvapich2 works, however.  This is using the Lustre file system on a Sandybridge Linux cluster here.  Are you aware of any task limits associated with MPI-IO in these implementations that might be causing the problem? Any ideas for reconfiguring?   There's a little more information in the email stream below.

"n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read: sk 77 ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0 R-0"

Something timed out.  Either the infiniband layer or the Lustre layer
is -- rightly -- astonished that a request took 82 seconds.

So, what's different about Intel MPI, OpenMPI and Mvapich2 with
respect to Lustre?  Can you give me version numbers for the three
packages?   I'm asking because we have over the years improved the
Lustre driver in MPICH's ROMIO thanks to community contributions.  I
*thought* those changes had made it into the various "downstream" MPI
implementations.

MVAPICH2 at one point (and maybe still) had an alternate Lustre
driver, which may explain why it performs well.  As it turns out, when
an MPI implementation pays attention to a file system, good things
happen.  Go figure!

==rob

> 
> -----Original Message-----
> From: Regimbal, Kevin 
> Sent: Sunday, August 04, 2013 1:33 PM
> To: Michalakes, John
> 
> Hi John,
> 
> I've been playing with parallel-netcdf this weekend.  As far as I can tell, pnetcdf does not work at large core counts (i.e. 4096) for either intel MPI or openMPI.  It does work with mvapich2 at 4096 cores.  I added a build for mvapich2 and a pnetcdf that ties to intel & mvapich2.
> 
> It's probably going to take a while to track down why large core counts work.  Not sure if the issue is pnetcdf or MPIIO on the other MPIs.
> 
> Kevin
> ________________________________________
> From: Michalakes, John
> Sent: Thursday, August 01, 2013 3:42 PM
> Cc: Regimbal, Kevin
> 
> [...]
> 
> Regarding pNetCDF, this time the executable just hung reading the first input file.  I then did another run and made sure to put an lsf setstripe -c 4 . command in the runscript.  It hung again but this time at least one of the tasks output this strange message before hanging:
> 
> n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read: sk 77 ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0 R-0
> 
> [...]
> 
> I'm tried another run, this time, copying the input data into the directory instead of accessing it via a symlink.  Then I saw this:
> 
> n0240:c9e5:50cde700: 64137185 us(64137185 us!!!):  CONN_REQUEST: SOCKOPT ERR No route to host -> 172.20.3.55 53644 - ABORTING 5
> n0240:c9e5:50cde700: 64137227 us(42 us): dapl_evd_conn_cb() unknown event 0x0
> n0240:c9e5:50cde700: 64173162 us(35935 us):  CONN_REQUEST: SOCKOPT ERR No route to host -> 172.20.3.56 53631 - ABORTING 5
> n0240:c9e5:50cde700: 64173192 us(30 us): dapl_evd_conn_cb() unknown event 0x0
> 

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA