pNetCDF problem with WRF on large core counts

Mon Aug 5 17:59:32 CDT 2013

Hi, Kevin

>> OpenMPI version 1.6.4
> 
> released in February of this year; it's Lustre ADIO driver is synced up with MPICH's. 

Actually, OpenMPI 1.6.4 has not completely caught up with the latest MPICH.
I found an error when testing a PnetCDF program. The fix can be
found from https://trac.mpich.org/projects/mpich/changeset/71d6dfe74c91596ed6e7119b18518757dcd285cd/src/mpi/romio/adio/include/adioi.h

>> MVapich2 version 1.8.1
> 
> April 2012. 
> I'll have to see what if any magic MVAPICH2 adds to (or maybe subtracts from?) to MPICH's ROMIO.

I have never tried MVapich2 or Intel MPI. But
I had ran 40K MPI processes using Cray's MPI without problems.
(Application is GCRM, machine is Hopper @ NERSC, Luster file system with up to 156 OSTs)
(Karen Schuchardt has run 80K processes there successfully.)

>> We're running lustre 1.8.9 clients and 2.1.3 on the MDS/OSSs
>> 
>> I was running the perform-test-pnetcdf.c I found on this page:
>> http://software.intel.com/en-us/forums/topic/373166
>> 
> 
> The floating point error when number of procs or number of stripes are too large has been fixed in MPICH for nearly 3 years.  I think that's enough time for these downstream implementations to pick up that change.
> 
> The test looks pretty straightforward: 
> - make ten files
> - each file has 50 variables
> - each variable is of type float
> - each variable has dimensions 4096 by 2048
> - each process writes patches LAT/lat_procs by LON/lon_procs 
> 
> So, I guess when there are more than 4k processors the size of one or both dimensions must get pretty small.  This is, after all, a strong scaling test.  Shouldn't be small enough to break something, of course. 
> 
> So, how are 4k processors decomposed? Each one will move 8 KiB of data. Let's say 128 x 32 perhaps?  That would result in each client doing 32x64 floats.  Doesn't seem problematic from a pnetcdf standpoint.  It will, though,  be a pretty aggressive test of the MPI-IO layer. 

The test program does not look suspicious to me, but I suggest to check
the return error codes for all PnetCDF calls.
Attached is a revised version with error checking.

>> I tested both scaling number of processes, and lustre stripe sizes
>> with
>> parallel-netcdf version 1.3.1
>> 
>> The test runs successfully at lfs stripe sizes of 1, 64, and 100 in
>> all
>> three MPIs.
>> The test runs successfully for 2048 cores in all three MPIs.
>> The test only runs successfully on mvapich2 for 4096 cores.
> 
> Wei-keng has a lot more Lustre experience than I do, but he's on vacation in Taiwan right now.  I'm sure he'll respond to you but it might not be with his customary low latency :>

If you like, you can build ROMIO as a stand-alone library and link it
to your program. I have been doing this to avoid problems that might
come from MPI-IO and to develop my own Lustre ADIO driver.

Wei-keng

-------------- next part --------------
A non-text attachment was scrubbed...
Name: perform-test-pnetcdf_wkl.c
Type: application/octet-stream
Size: 2842 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20130805/bac3cab8/attachment.obj>
-------------- next part --------------

>> Kevin
>> 
>> 
>> On 8/5/13 7:50 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>> 
>>> On Mon, Aug 05, 2013 at 06:38:56AM -0600, Michalakes, John wrote:
>>>> Hi,
>>>> 
>>>> We're running into problems running WRF with pNetCDF and it may
>>>> have
>>>> something to do with which MPI implementation we use.  Both Intel
>>>> MPI
>>>> and OpenMPI fail (hang processes) on MPI task counts greater that
>>>> 256.
>>>> Mvapich2 works, however.  This is using the Lustre file system on a
>>>> Sandybridge Linux cluster here.  Are you aware of any task limits
>>>> associated with MPI-IO in these implementations that might be
>>>> causing
>>>> the problem? Any ideas for reconfiguring?   There's a little more
>>>> information in the email stream below.
>>> 
>>> "n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read:
>>> sk 77
>>> ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
>>> R-0"
>>> 
>>> 
>>> Something timed out.  Either the infiniband layer or the Lustre
>>> layer
>>> is -- rightly -- astonished that a request took 82 seconds.
>>> 
>>> So, what's different about Intel MPI, OpenMPI and Mvapich2 with
>>> respect to Lustre?  Can you give me version numbers for the three
>>> packages?   I'm asking because we have over the years improved the
>>> Lustre driver in MPICH's ROMIO thanks to community contributions.  I
>>> *thought* those changes had made it into the various "downstream"
>>> MPI
>>> implementations.
>>> 
>>> MVAPICH2 at one point (and maybe still) had an alternate Lustre
>>> driver, which may explain why it performs well.  As it turns out,
>>> when
>>> an MPI implementation pays attention to a file system, good things
>>> happen.  Go figure!
>>> 
>>> ==rob
>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Regimbal, Kevin
>>>> Sent: Sunday, August 04, 2013 1:33 PM
>>>> To: Michalakes, John
>>>> 
>>>> Hi John,
>>>> 
>>>> I've been playing with parallel-netcdf this weekend.  As far as I
>>>> can
>>>> tell, pnetcdf does not work at large core counts (i.e. 4096) for
>>>> either
>>>> intel MPI or openMPI.  It does work with mvapich2 at 4096 cores.  I
>>>> added a build for mvapich2 and a pnetcdf that ties to intel &
>>>> mvapich2.
>>>> 
>>>> It's probably going to take a while to track down why large core
>>>> counts
>>>> work.  Not sure if the issue is pnetcdf or MPIIO on the other MPIs.
>>>> 
>>>> Kevin
>>>> ________________________________________
>>>> From: Michalakes, John
>>>> Sent: Thursday, August 01, 2013 3:42 PM
>>>> Cc: Regimbal, Kevin
>>>> 
>>>> [...]
>>>> 
>>>> Regarding pNetCDF, this time the executable just hung reading the
>>>> first
>>>> input file.  I then did another run and made sure to put an lsf
>>>> setstripe -c 4 . command in the runscript.  It hung again but this
>>>> time
>>>> at least one of the tasks output this strange message before
>>>> hanging:
>>>> 
>>>> n0224:c7f1:167b2700: 82084370 us(82084370 us!!!):  CONN_RTU read:
>>>> sk 77
>>>> ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
>>>> R-0
>>>> 
>>>> [...]
>>>> 
>>>> I'm tried another run, this time, copying the input data into the
>>>> directory instead of accessing it via a symlink.  Then I saw this:
>>>> 
>>>> n0240:c9e5:50cde700: 64137185 us(64137185 us!!!):  CONN_REQUEST:
>>>> SOCKOPT ERR No route to host -> 172.20.3.55 53644 - ABORTING 5
>>>> n0240:c9e5:50cde700: 64137227 us(42 us): dapl_evd_conn_cb()
>>>> unknown
>>>> event 0x0
>>>> n0240:c9e5:50cde700: 64173162 us(35935 us):  CONN_REQUEST: SOCKOPT
>>>> ERR
>>>> No route to host -> 172.20.3.56 53631 - ABORTING 5
>>>> n0240:c9e5:50cde700: 64173192 us(30 us): dapl_evd_conn_cb()
>>>> unknown
>>>> event 0x0
>>>> 
>>> 
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>> 
>>