pNetCDF problem with WRF on large core counts
Wei-keng Liao
wkliao at ece.northwestern.edu
Mon Aug 5 17:59:32 CDT 2013
Hi, Kevin
>> OpenMPI version 1.6.4
>
> released in February of this year; it's Lustre ADIO driver is synced up with MPICH's.
Actually, OpenMPI 1.6.4 has not completely caught up with the latest MPICH.
I found an error when testing a PnetCDF program. The fix can be
found from https://trac.mpich.org/projects/mpich/changeset/71d6dfe74c91596ed6e7119b18518757dcd285cd/src/mpi/romio/adio/include/adioi.h
>> MVapich2 version 1.8.1
>
> April 2012.
> I'll have to see what if any magic MVAPICH2 adds to (or maybe subtracts from?) to MPICH's ROMIO.
I have never tried MVapich2 or Intel MPI. But
I had ran 40K MPI processes using Cray's MPI without problems.
(Application is GCRM, machine is Hopper @ NERSC, Luster file system with up to 156 OSTs)
(Karen Schuchardt has run 80K processes there successfully.)
>> We're running lustre 1.8.9 clients and 2.1.3 on the MDS/OSSs
>>
>> I was running the perform-test-pnetcdf.c I found on this page:
>> http://software.intel.com/en-us/forums/topic/373166
>>
>
> The floating point error when number of procs or number of stripes are too large has been fixed in MPICH for nearly 3 years. I think that's enough time for these downstream implementations to pick up that change.
>
> The test looks pretty straightforward:
> - make ten files
> - each file has 50 variables
> - each variable is of type float
> - each variable has dimensions 4096 by 2048
> - each process writes patches LAT/lat_procs by LON/lon_procs
>
> So, I guess when there are more than 4k processors the size of one or both dimensions must get pretty small. This is, after all, a strong scaling test. Shouldn't be small enough to break something, of course.
>
> So, how are 4k processors decomposed? Each one will move 8 KiB of data. Let's say 128 x 32 perhaps? That would result in each client doing 32x64 floats. Doesn't seem problematic from a pnetcdf standpoint. It will, though, be a pretty aggressive test of the MPI-IO layer.
The test program does not look suspicious to me, but I suggest to check
the return error codes for all PnetCDF calls.
Attached is a revised version with error checking.
>> I tested both scaling number of processes, and lustre stripe sizes
>> with
>> parallel-netcdf version 1.3.1
>>
>> The test runs successfully at lfs stripe sizes of 1, 64, and 100 in
>> all
>> three MPIs.
>> The test runs successfully for 2048 cores in all three MPIs.
>> The test only runs successfully on mvapich2 for 4096 cores.
>
> Wei-keng has a lot more Lustre experience than I do, but he's on vacation in Taiwan right now. I'm sure he'll respond to you but it might not be with his customary low latency :>
If you like, you can build ROMIO as a stand-alone library and link it
to your program. I have been doing this to avoid problems that might
come from MPI-IO and to develop my own Lustre ADIO driver.
Wei-keng
-------------- next part --------------
A non-text attachment was scrubbed...
Name: perform-test-pnetcdf_wkl.c
Type: application/octet-stream
Size: 2842 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20130805/bac3cab8/attachment.obj>
-------------- next part --------------
>> Kevin
>>
>>
>> On 8/5/13 7:50 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>>
>>> On Mon, Aug 05, 2013 at 06:38:56AM -0600, Michalakes, John wrote:
>>>> Hi,
>>>>
>>>> We're running into problems running WRF with pNetCDF and it may
>>>> have
>>>> something to do with which MPI implementation we use. Both Intel
>>>> MPI
>>>> and OpenMPI fail (hang processes) on MPI task counts greater that
>>>> 256.
>>>> Mvapich2 works, however. This is using the Lustre file system on a
>>>> Sandybridge Linux cluster here. Are you aware of any task limits
>>>> associated with MPI-IO in these implementations that might be
>>>> causing
>>>> the problem? Any ideas for reconfiguring? There's a little more
>>>> information in the email stream below.
>>>
>>> "n0224:c7f1:167b2700: 82084370 us(82084370 us!!!): CONN_RTU read:
>>> sk 77
>>> ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
>>> R-0"
>>>
>>>
>>> Something timed out. Either the infiniband layer or the Lustre
>>> layer
>>> is -- rightly -- astonished that a request took 82 seconds.
>>>
>>> So, what's different about Intel MPI, OpenMPI and Mvapich2 with
>>> respect to Lustre? Can you give me version numbers for the three
>>> packages? I'm asking because we have over the years improved the
>>> Lustre driver in MPICH's ROMIO thanks to community contributions. I
>>> *thought* those changes had made it into the various "downstream"
>>> MPI
>>> implementations.
>>>
>>> MVAPICH2 at one point (and maybe still) had an alternate Lustre
>>> driver, which may explain why it performs well. As it turns out,
>>> when
>>> an MPI implementation pays attention to a file system, good things
>>> happen. Go figure!
>>>
>>> ==rob
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Regimbal, Kevin
>>>> Sent: Sunday, August 04, 2013 1:33 PM
>>>> To: Michalakes, John
>>>>
>>>> Hi John,
>>>>
>>>> I've been playing with parallel-netcdf this weekend. As far as I
>>>> can
>>>> tell, pnetcdf does not work at large core counts (i.e. 4096) for
>>>> either
>>>> intel MPI or openMPI. It does work with mvapich2 at 4096 cores. I
>>>> added a build for mvapich2 and a pnetcdf that ties to intel &
>>>> mvapich2.
>>>>
>>>> It's probably going to take a while to track down why large core
>>>> counts
>>>> work. Not sure if the issue is pnetcdf or MPIIO on the other MPIs.
>>>>
>>>> Kevin
>>>> ________________________________________
>>>> From: Michalakes, John
>>>> Sent: Thursday, August 01, 2013 3:42 PM
>>>> Cc: Regimbal, Kevin
>>>>
>>>> [...]
>>>>
>>>> Regarding pNetCDF, this time the executable just hung reading the
>>>> first
>>>> input file. I then did another run and made sure to put an lsf
>>>> setstripe -c 4 . command in the runscript. It hung again but this
>>>> time
>>>> at least one of the tasks output this strange message before
>>>> hanging:
>>>>
>>>> n0224:c7f1:167b2700: 82084370 us(82084370 us!!!): CONN_RTU read:
>>>> sk 77
>>>> ERR 0x68, rcnt=-1, v=7 -> 172.20.5.34 PORT L-bac4 R-e1c6 PID L-0
>>>> R-0
>>>>
>>>> [...]
>>>>
>>>> I'm tried another run, this time, copying the input data into the
>>>> directory instead of accessing it via a symlink. Then I saw this:
>>>>
>>>> n0240:c9e5:50cde700: 64137185 us(64137185 us!!!): CONN_REQUEST:
>>>> SOCKOPT ERR No route to host -> 172.20.3.55 53644 - ABORTING 5
>>>> n0240:c9e5:50cde700: 64137227 us(42 us): dapl_evd_conn_cb()
>>>> unknown
>>>> event 0x0
>>>> n0240:c9e5:50cde700: 64173162 us(35935 us): CONN_REQUEST: SOCKOPT
>>>> ERR
>>>> No route to host -> 172.20.3.56 53631 - ABORTING 5
>>>> n0240:c9e5:50cde700: 64173192 us(30 us): dapl_evd_conn_cb()
>>>> unknown
>>>> event 0x0
>>>>
>>>
>>> --
>>> Rob Latham
>>> Mathematics and Computer Science Division
>>> Argonne National Lab, IL USA
>>
>>
More information about the parallel-netcdf
mailing list