pNetCDF problem with WRF on large core counts

Mon Aug 5 13:27:36 CDT 2013

> From: "Kevin Regimbal" <Kevin.Regimbal at nrel.gov>
> To: "Rob Latham" <robl at mcs.anl.gov>, "John Michalakes" <John.Michalakes at nrel.gov>
> Cc: parallel-netcdf at mcs.anl.gov, "Wesley Jones" <Wesley.Jones at nrel.gov>, "Ilene Carpenter" <Ilene.Carpenter at nrel.gov>
> Sent: Monday, August 5, 2013 11:17:44 AM
> Subject: Re: pNetCDF problem with WRF on large core counts
> 
> Intel MPI version 4.1.0

released september 2012.   I don't know what version of MPICH it is based on, but it's recent enough that i'm pretty sure it has all the MPICH fixes. 

> OpenMPI version 1.6.4

released in February of this year; it's Lustre ADIO driver is synced up with MPICH's. 

> MVapich2 version 1.8.1

April 2012. 
I'll have to see what if any magic MVAPICH2 adds to (or maybe subtracts from?) to MPICH's ROMIO.  

> 
> We're running lustre 1.8.9 clients and 2.1.3 on the MDS/OSSs
> 
> I was running the perform-test-pnetcdf.c I found on this page:
> http://software.intel.com/en-us/forums/topic/373166
> 

The floating point error when number of procs or number of stripes are too large has been fixed in MPICH for nearly 3 years.  I think that's enough time for these downstream implementations to pick up that change.

The test looks pretty straightforward: 
- make ten files
- each file has 50 variables
- each variable is of type float
- each variable has dimensions 4096 by 2048
- each process writes patches LAT/lat_procs by LON/lon_procs 

So, I guess when there are more than 4k processors the size of one or both dimensions must get pretty small.  This is, after all, a strong scaling test.  Shouldn't be small enough to break something, of course. 

So, how are 4k processors decomposed? Each one will move 8 KiB of data. Let's say 128 x 32 perhaps?  That would result in each client doing 32x64 floats.  Doesn't seem problematic from a pnetcdf standpoint.  It will, though,  be a pretty aggressive test of the MPI-IO layer. 

> I tested both scaling number of processes, and lustre stripe sizes
> with
> parallel-netcdf version 1.3.1
> 
> The test runs successfully at lfs stripe sizes of 1, 64, and 100 in
> all
> three MPIs.
> The test runs successfully for 2048 cores in all three MPIs.
> The test only runs successfully on mvapich2 for 4096 cores.

Wei-keng has a lot more Lustre experience than I do, but he's on vacation in Taiwan right now.  I'm sure he'll respond to you but it might not be with his customary low latency :>

> 
> Kevin
> 
> 
> On 8/5/13 7:50 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
> 
> >On Mon, Aug 05, 2013 at 06:38:56AM -0600, Michalakes, John wrote:
> >> Hi,
> >> 
> >> We're running into problems running WRF with pNetCDF and it may
> >> have
> >>something to do with which MPI implementation we use.  Both Intel
> >>MPI
> >>and OpenMPI fail (hang processes) on MPI task counts greater that
> >>256.
> >>Mvapich2 works, however.  This is using the Lustre file system on a
> >>Sandybridge Linux cluster here.  Are you aware of any task limits
> >>associated with MPI-IO in these implementations that might be
> >>causing
> >>the problem? Any ideas for reconfiguring?   There's a little more
> >>information in the email stream below.
> >
> >"n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read:
> >sk 77
> >ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
> >R-0"
> >
> >
> >Something timed out.  Either the infiniband layer or the Lustre
> >layer
> >is -- rightly -- astonished that a request took 82 seconds.
> >
> >So, what's different about Intel MPI, OpenMPI and Mvapich2 with
> >respect to Lustre?  Can you give me version numbers for the three
> >packages?   I'm asking because we have over the years improved the
> >Lustre driver in MPICH's ROMIO thanks to community contributions.  I
> >*thought* those changes had made it into the various "downstream"
> >MPI
> >implementations.
> >
> >MVAPICH2 at one point (and maybe still) had an alternate Lustre
> >driver, which may explain why it performs well.  As it turns out,
> >when
> >an MPI implementation pays attention to a file system, good things
> >happen.  Go figure!
> >
> >==rob
> >
> >> 
> >> -----Original Message-----
> >> From: Regimbal, Kevin
> >> Sent: Sunday, August 04, 2013 1:33 PM
> >> To: Michalakes, John
> >> 
> >> Hi John,
> >> 
> >> I've been playing with parallel-netcdf this weekend.  As far as I
> >> can
> >>tell, pnetcdf does not work at large core counts (i.e. 4096) for
> >>either
> >>intel MPI or openMPI.  It does work with mvapich2 at 4096 cores.  I
> >>added a build for mvapich2 and a pnetcdf that ties to intel &
> >>mvapich2.
> >> 
> >> It's probably going to take a while to track down why large core
> >> counts
> >>work.  Not sure if the issue is pnetcdf or MPIIO on the other MPIs.
> >> 
> >> Kevin
> >> ________________________________________
> >> From: Michalakes, John
> >> Sent: Thursday, August 01, 2013 3:42 PM
> >> Cc: Regimbal, Kevin
> >> 
> >> [...]
> >> 
> >> Regarding pNetCDF, this time the executable just hung reading the
> >> first
> >>input file.  I then did another run and made sure to put an lsf
> >>setstripe -c 4 . command in the runscript.  It hung again but this
> >>time
> >>at least one of the tasks output this strange message before
> >>hanging:
> >> 
> >> n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read:
> >> sk 77
> >>ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
> >>R-0
> >> 
> >> [...]
> >> 
> >> I'm tried another run, this time, copying the input data into the
> >>directory instead of accessing it via a symlink.  Then I saw this:
> >> 
> >> n0240:c9e5:50cde700: 64137185 us(64137185 us!!!):  CONN_REQUEST:
> >>SOCKOPT ERR No route to host -> 172.20.3.55 53644 - ABORTING 5
> >> n0240:c9e5:50cde700: 64137227 us(42 us): dapl_evd_conn_cb()
> >> unknown
> >>event 0x0
> >> n0240:c9e5:50cde700: 64173162 us(35935 us):  CONN_REQUEST: SOCKOPT
> >> ERR
> >>No route to host -> 172.20.3.56 53631 - ABORTING 5
> >> n0240:c9e5:50cde700: 64173192 us(30 us): dapl_evd_conn_cb()
> >> unknown
> >>event 0x0
> >> 
> >
> >--
> >Rob Latham
> >Mathematics and Computer Science Division
> >Argonne National Lab, IL USA
> 
>