pNetCDF problem with WRF on large core counts
Rob Latham
robl at mcs.anl.gov
Mon Aug 5 13:27:36 CDT 2013
> From: "Kevin Regimbal" <Kevin.Regimbal at nrel.gov>
> To: "Rob Latham" <robl at mcs.anl.gov>, "John Michalakes" <John.Michalakes at nrel.gov>
> Cc: parallel-netcdf at mcs.anl.gov, "Wesley Jones" <Wesley.Jones at nrel.gov>, "Ilene Carpenter" <Ilene.Carpenter at nrel.gov>
> Sent: Monday, August 5, 2013 11:17:44 AM
> Subject: Re: pNetCDF problem with WRF on large core counts
>
> Intel MPI version 4.1.0
released september 2012. I don't know what version of MPICH it is based on, but it's recent enough that i'm pretty sure it has all the MPICH fixes.
> OpenMPI version 1.6.4
released in February of this year; it's Lustre ADIO driver is synced up with MPICH's.
> MVapich2 version 1.8.1
April 2012.
I'll have to see what if any magic MVAPICH2 adds to (or maybe subtracts from?) to MPICH's ROMIO.
>
> We're running lustre 1.8.9 clients and 2.1.3 on the MDS/OSSs
>
> I was running the perform-test-pnetcdf.c I found on this page:
> http://software.intel.com/en-us/forums/topic/373166
>
The floating point error when number of procs or number of stripes are too large has been fixed in MPICH for nearly 3 years. I think that's enough time for these downstream implementations to pick up that change.
The test looks pretty straightforward:
- make ten files
- each file has 50 variables
- each variable is of type float
- each variable has dimensions 4096 by 2048
- each process writes patches LAT/lat_procs by LON/lon_procs
So, I guess when there are more than 4k processors the size of one or both dimensions must get pretty small. This is, after all, a strong scaling test. Shouldn't be small enough to break something, of course.
So, how are 4k processors decomposed? Each one will move 8 KiB of data. Let's say 128 x 32 perhaps? That would result in each client doing 32x64 floats. Doesn't seem problematic from a pnetcdf standpoint. It will, though, be a pretty aggressive test of the MPI-IO layer.
> I tested both scaling number of processes, and lustre stripe sizes
> with
> parallel-netcdf version 1.3.1
>
> The test runs successfully at lfs stripe sizes of 1, 64, and 100 in
> all
> three MPIs.
> The test runs successfully for 2048 cores in all three MPIs.
> The test only runs successfully on mvapich2 for 4096 cores.
Wei-keng has a lot more Lustre experience than I do, but he's on vacation in Taiwan right now. I'm sure he'll respond to you but it might not be with his customary low latency :>
>
> Kevin
>
>
> On 8/5/13 7:50 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>
> >On Mon, Aug 05, 2013 at 06:38:56AM -0600, Michalakes, John wrote:
> >> Hi,
> >>
> >> We're running into problems running WRF with pNetCDF and it may
> >> have
> >>something to do with which MPI implementation we use. Both Intel
> >>MPI
> >>and OpenMPI fail (hang processes) on MPI task counts greater that
> >>256.
> >>Mvapich2 works, however. This is using the Lustre file system on a
> >>Sandybridge Linux cluster here. Are you aware of any task limits
> >>associated with MPI-IO in these implementations that might be
> >>causing
> >>the problem? Any ideas for reconfiguring? There's a little more
> >>information in the email stream below.
> >
> >"n0224:c7f1:167b2700: 82084370 us(82084370 us!!!): CONN_RTU read:
> >sk 77
> >ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
> >R-0"
> >
> >
> >Something timed out. Either the infiniband layer or the Lustre
> >layer
> >is -- rightly -- astonished that a request took 82 seconds.
> >
> >So, what's different about Intel MPI, OpenMPI and Mvapich2 with
> >respect to Lustre? Can you give me version numbers for the three
> >packages? I'm asking because we have over the years improved the
> >Lustre driver in MPICH's ROMIO thanks to community contributions. I
> >*thought* those changes had made it into the various "downstream"
> >MPI
> >implementations.
> >
> >MVAPICH2 at one point (and maybe still) had an alternate Lustre
> >driver, which may explain why it performs well. As it turns out,
> >when
> >an MPI implementation pays attention to a file system, good things
> >happen. Go figure!
> >
> >==rob
> >
> >>
> >> -----Original Message-----
> >> From: Regimbal, Kevin
> >> Sent: Sunday, August 04, 2013 1:33 PM
> >> To: Michalakes, John
> >>
> >> Hi John,
> >>
> >> I've been playing with parallel-netcdf this weekend. As far as I
> >> can
> >>tell, pnetcdf does not work at large core counts (i.e. 4096) for
> >>either
> >>intel MPI or openMPI. It does work with mvapich2 at 4096 cores. I
> >>added a build for mvapich2 and a pnetcdf that ties to intel &
> >>mvapich2.
> >>
> >> It's probably going to take a while to track down why large core
> >> counts
> >>work. Not sure if the issue is pnetcdf or MPIIO on the other MPIs.
> >>
> >> Kevin
> >> ________________________________________
> >> From: Michalakes, John
> >> Sent: Thursday, August 01, 2013 3:42 PM
> >> Cc: Regimbal, Kevin
> >>
> >> [...]
> >>
> >> Regarding pNetCDF, this time the executable just hung reading the
> >> first
> >>input file. I then did another run and made sure to put an lsf
> >>setstripe -c 4 . command in the runscript. It hung again but this
> >>time
> >>at least one of the tasks output this strange message before
> >>hanging:
> >>
> >> n0224:c7f1:167b2700: 82084370 us(82084370 us!!!): CONN_RTU read:
> >> sk 77
> >>ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
> >>R-0
> >>
> >> [...]
> >>
> >> I'm tried another run, this time, copying the input data into the
> >>directory instead of accessing it via a symlink. Then I saw this:
> >>
> >> n0240:c9e5:50cde700: 64137185 us(64137185 us!!!): CONN_REQUEST:
> >>SOCKOPT ERR No route to host -> 172.20.3.55 53644 - ABORTING 5
> >> n0240:c9e5:50cde700: 64137227 us(42 us): dapl_evd_conn_cb()
> >> unknown
> >>event 0x0
> >> n0240:c9e5:50cde700: 64173162 us(35935 us): CONN_REQUEST: SOCKOPT
> >> ERR
> >>No route to host -> 172.20.3.56 53631 - ABORTING 5
> >> n0240:c9e5:50cde700: 64173192 us(30 us): dapl_evd_conn_cb()
> >> unknown
> >>event 0x0
> >>
> >
> >--
> >Rob Latham
> >Mathematics and Computer Science Division
> >Argonne National Lab, IL USA
>
>
More information about the parallel-netcdf
mailing list