Hints on improving performance with WRF and Pnetcdf

Craig Tierney Craig.Tierney at noaa.gov
Wed Sep 8 08:36:56 CDT 2010


On 9/7/10 12:18 PM, Wei-keng Liao wrote:
> Hi, Craig,
>
> Do you mean stripe size 16 or stripe count?
> Can you describe your I/O patterns? Are you using collective APIs only?
>
> I was not using pnetcdf, but MPI-IO directly. My test program is a 3D array
> block-block-block partitioning code modified from ROMIO's coll_perf.c.
> You can give it a try: /share/home/00531/tg457823/coll.c
>
> When I ran 1024 processes, it completed in 31 seconds (see output below) and
> if I used Ranger mvapich2 natively, the job did not finish and was killed after 10 minutes.
>
> MPI hint: striping_factor            = 32
> MPI hint: striping_unit              = 1048576
> Local array size 100 x 100 x 100 integers, size = 3.81 MB
> Global array size 1600 x 1600 x 400 integers, write size = 38.15 GB
>   procs    Global array size  exec(sec)  write(MB/s)
> -------  ------------------  ---------  -----------
>   1024    1600 x 1600 x  400    31.04     1258.47
>
>

Thanks for the details.  I did mean stripe count (-c), not stripe size 
(-s).  The default on the stripe size is 1MB.

I am not aware of my IO patterns (I know I should, but I am trying to
act like a normal user and not dig into these things).  I am using WRF, 
which is passing IO to PNETCDF, and then onto Romio.  My concern is how 
to get WRF and Pnetcdf to do anything useful.

Craig


> Wei-keng
>
> On Sep 7, 2010, at 12:37 PM, Craig Tierney wrote:
>
>> On 9/6/10 10:36 AM, Wei-keng Liao wrote:
>>> Gerry,
>>>
>>> I ran a 1024-PE job yesterday on Ranger using 32 stripe count without a problem.
>>> Lustre should not have any problem simply because of the use of a large
>>> stripe count. Do you use pnetcdf independent APIs in your program?
>>> If you are using collective APIs only, do you access variables partially
>>> (i.e. subarrays) or always entire variables? A large number of
>>> noncontiguous file accesses may flood the I/O servers and slow down the I/O
>>> performance, but that still should not shut down the Lustre. Maybe Ranger's
>>> root have a better answer on this.
>>>
>>> Wei-keng
>>>
>>
>> Wei-keng,
>>
>> Can you characterize how much after your IO is going when you are using pNetcdf?  I am now trying a stripe size of 16, and I am not seeing any improvement.
>>
>> Craig
>>
>>> On Sep 6, 2010, at 8:47 AM, Gerry Creager wrote:
>>>
>>>> Wei-keng
>>>>
>>>> Thanks. Useful information. I'll look at your ROMIO library later today (about to go into a meeting for the rest of the morning).  Last time I set stripe-count to homething above 16, rsl files were also "taking advantage" of that and shut down the LFS. Have you seen this or do you address this in ROMIO?
>>>>
>>>> gerry
>>>>
>>>> Wei-keng Liao wrote:
>>>>> Hi, Gerry and Craig,
>>>>> I would like to provide my experience on Ranger.
>>>>> First, I agree with Rob that the most recent optimizations for Lustre
>>>>> ADIO driver might not be yet installed on Ranger. Because in my
>>>>> experiments on Ranger, the MPI collective write performance is poor.
>>>>> I have built a ROMIO library with the recent optimizations for Lustre
>>>>> in my home directory and you are welcomed to give it a try. Below is
>>>>> the usage example of the library:
>>>>> %  mpif90 myprogram.o -L/share/home/00531/tg457823/ROMIO/lib -lmpio
>>>>> Please note that this library is built using mvapich2 on Ranger. Run the
>>>>> command below before compile/link your programs.
>>>>> %  module load mvapich2
>>>>> I usually set the Lustre striping configuration for the output directory
>>>>> before I ran applications. I use 1MB stripe size, stripe counts 32,
>>>>> 64 or 128, and the stripe offset -1. Since by Lustre default all files
>>>>> created under a directory inherit the same striping configuration of
>>>>> that directory and my ROMIO built detects these striping configurations
>>>>> automatically, there is no need for me to set ROMIO hints in my programs.
>>>>> You can verify the striping configuration of a newly created file by
>>>>> this command, for example:
>>>>> % lfs getstripe -v /scratch/00531/tg457823/FS_1M_32/testfile.dat  | grep stripe
>>>>>   lmm_stripe_count:   32
>>>>>   lmm_stripe_size:    1048576
>>>>>   lmm_stripe_pattern: 1
>>>>> If you used pnetcdf collective I/O, I recommend to give my ROMIO library a try.
>>>>> Wei-keng
>>>>> On Sep 5, 2010, at 10:28 AM, Craig Tierney wrote:
>>>>>> On 9/4/10 8:25 PM, Gerry Creager wrote:
>>>>>>> Rob Latham wrote:
>>>>>>>> On Thu, Sep 02, 2010 at 06:23:42PM -0600, Craig Tierney wrote:
>>>>>>>>> I did try setting the hints myself by changing the code, and performance
>>>>>>>>> still stinks (or is no faster). I was just looking for a way to not
>>>>>>>>> have to modify WRF, or more importantly have every user modify WRF.
>>>>>>>> What's going slowly?
>>>>>>>> If wrf is slowly writing record variables, you might want to try
>>>>>>>> disabling collective I/O or carefully selecting the intermediate
>>>>>>>> buffer to be as big as one record.
>>>>>>>>
>>>>>>>> That's the first place I'd look for bad performance.
>>>>>>> Ah, but I'm seeing the same thing on Ranger (UTexas). I'm likely going
>>>>>>> to have to modify the WRF pnetcdf code to identify a sufficiently large
>>>>>>> stripe count (Lustre file system) to see any sort of real improvement.
>>>>>>>
>>>>>>> More to the point, I see worse performance than with normal Lustre and
>>>>>>> regular netcdf. AND, there's no way to set MPI-IO-HINTS in the SGE as
>>>>>>> configured on Ranger. We've tried and their systems folk concur, so it's
>>>>>>> not just me saying it.
>>>>>>>
>>>>>> What do you mean you can't?  How would you set it in another batch system?
>>>>>>
>>>>>>> I will look at setting the hints file up but I don't think that's going
>>>>>>> to give me the equivalent of 64 stripe counts, which looks like the
>>>>>>> sweet spot for the domain I'm testing on.
>>>>>>>
>>>>>> So what Hints are you passing and is then the key to increase the number
>>>>>> of stripes for the directory?
>>>>>>
>>>>>>> Craig, one I have time to get back on to this, I think we can convince
>>>>>>> NCAR to add this as a bug release. I also anticipate the tweak will be
>>>>>>> on the order of 4-5 lines.
>>>>>>>
>>>>>> I already wrote code so that if you set the variable WRF_MPIIO_HINTS, and list all the hints you want to set (comma delimited), then the code in external/io_pnetcdf/wrf_IO.F90 will set the hints for you.  When
>>>>>> I see that any of this actually helps I will send the patch in for future use.
>>>>>>
>>>>>> Craig
>>>>>>
>>>>
>>>>
>>>> --
>>>> Gerry Creager -- gerry.creager at tamu.edu
>>>> Texas Mesonet -- AATLT, Texas A&M University
>>>> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
>>>> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
>>>
>>
>



More information about the parallel-netcdf mailing list