Hints on improving performance with WRF and Pnetcdf

Wei-keng Liao wkliao at ece.northwestern.edu
Tue Sep 7 13:18:24 CDT 2010


Hi, Craig,

Do you mean stripe size 16 or stripe count?
Can you describe your I/O patterns? Are you using collective APIs only?

I was not using pnetcdf, but MPI-IO directly. My test program is a 3D array
block-block-block partitioning code modified from ROMIO's coll_perf.c.
You can give it a try: /share/home/00531/tg457823/coll.c

When I ran 1024 processes, it completed in 31 seconds (see output below) and
if I used Ranger mvapich2 natively, the job did not finish and was killed after 10 minutes.

MPI hint: striping_factor            = 32
MPI hint: striping_unit              = 1048576
Local array size 100 x 100 x 100 integers, size = 3.81 MB
Global array size 1600 x 1600 x 400 integers, write size = 38.15 GB
 procs    Global array size  exec(sec)  write(MB/s)
-------  ------------------  ---------  -----------
 1024    1600 x 1600 x  400    31.04     1258.47


Wei-keng

On Sep 7, 2010, at 12:37 PM, Craig Tierney wrote:

> On 9/6/10 10:36 AM, Wei-keng Liao wrote:
>> Gerry,
>> 
>> I ran a 1024-PE job yesterday on Ranger using 32 stripe count without a problem.
>> Lustre should not have any problem simply because of the use of a large
>> stripe count. Do you use pnetcdf independent APIs in your program?
>> If you are using collective APIs only, do you access variables partially
>> (i.e. subarrays) or always entire variables? A large number of
>> noncontiguous file accesses may flood the I/O servers and slow down the I/O
>> performance, but that still should not shut down the Lustre. Maybe Ranger's
>> root have a better answer on this.
>> 
>> Wei-keng
>> 
> 
> Wei-keng,
> 
> Can you characterize how much after your IO is going when you are using pNetcdf?  I am now trying a stripe size of 16, and I am not seeing any improvement.
> 
> Craig
> 
>> On Sep 6, 2010, at 8:47 AM, Gerry Creager wrote:
>> 
>>> Wei-keng
>>> 
>>> Thanks. Useful information. I'll look at your ROMIO library later today (about to go into a meeting for the rest of the morning).  Last time I set stripe-count to homething above 16, rsl files were also "taking advantage" of that and shut down the LFS. Have you seen this or do you address this in ROMIO?
>>> 
>>> gerry
>>> 
>>> Wei-keng Liao wrote:
>>>> Hi, Gerry and Craig,
>>>> I would like to provide my experience on Ranger.
>>>> First, I agree with Rob that the most recent optimizations for Lustre
>>>> ADIO driver might not be yet installed on Ranger. Because in my
>>>> experiments on Ranger, the MPI collective write performance is poor.
>>>> I have built a ROMIO library with the recent optimizations for Lustre
>>>> in my home directory and you are welcomed to give it a try. Below is
>>>> the usage example of the library:
>>>> %  mpif90 myprogram.o -L/share/home/00531/tg457823/ROMIO/lib -lmpio
>>>> Please note that this library is built using mvapich2 on Ranger. Run the
>>>> command below before compile/link your programs.
>>>> %  module load mvapich2
>>>> I usually set the Lustre striping configuration for the output directory
>>>> before I ran applications. I use 1MB stripe size, stripe counts 32,
>>>> 64 or 128, and the stripe offset -1. Since by Lustre default all files
>>>> created under a directory inherit the same striping configuration of
>>>> that directory and my ROMIO built detects these striping configurations
>>>> automatically, there is no need for me to set ROMIO hints in my programs.
>>>> You can verify the striping configuration of a newly created file by
>>>> this command, for example:
>>>> % lfs getstripe -v /scratch/00531/tg457823/FS_1M_32/testfile.dat  | grep stripe
>>>>  lmm_stripe_count:   32
>>>>  lmm_stripe_size:    1048576
>>>>  lmm_stripe_pattern: 1
>>>> If you used pnetcdf collective I/O, I recommend to give my ROMIO library a try.
>>>> Wei-keng
>>>> On Sep 5, 2010, at 10:28 AM, Craig Tierney wrote:
>>>>> On 9/4/10 8:25 PM, Gerry Creager wrote:
>>>>>> Rob Latham wrote:
>>>>>>> On Thu, Sep 02, 2010 at 06:23:42PM -0600, Craig Tierney wrote:
>>>>>>>> I did try setting the hints myself by changing the code, and performance
>>>>>>>> still stinks (or is no faster). I was just looking for a way to not
>>>>>>>> have to modify WRF, or more importantly have every user modify WRF.
>>>>>>> What's going slowly?
>>>>>>> If wrf is slowly writing record variables, you might want to try
>>>>>>> disabling collective I/O or carefully selecting the intermediate
>>>>>>> buffer to be as big as one record.
>>>>>>> 
>>>>>>> That's the first place I'd look for bad performance.
>>>>>> Ah, but I'm seeing the same thing on Ranger (UTexas). I'm likely going
>>>>>> to have to modify the WRF pnetcdf code to identify a sufficiently large
>>>>>> stripe count (Lustre file system) to see any sort of real improvement.
>>>>>> 
>>>>>> More to the point, I see worse performance than with normal Lustre and
>>>>>> regular netcdf. AND, there's no way to set MPI-IO-HINTS in the SGE as
>>>>>> configured on Ranger. We've tried and their systems folk concur, so it's
>>>>>> not just me saying it.
>>>>>> 
>>>>> What do you mean you can't?  How would you set it in another batch system?
>>>>> 
>>>>>> I will look at setting the hints file up but I don't think that's going
>>>>>> to give me the equivalent of 64 stripe counts, which looks like the
>>>>>> sweet spot for the domain I'm testing on.
>>>>>> 
>>>>> So what Hints are you passing and is then the key to increase the number
>>>>> of stripes for the directory?
>>>>> 
>>>>>> Craig, one I have time to get back on to this, I think we can convince
>>>>>> NCAR to add this as a bug release. I also anticipate the tweak will be
>>>>>> on the order of 4-5 lines.
>>>>>> 
>>>>> I already wrote code so that if you set the variable WRF_MPIIO_HINTS, and list all the hints you want to set (comma delimited), then the code in external/io_pnetcdf/wrf_IO.F90 will set the hints for you.  When
>>>>> I see that any of this actually helps I will send the patch in for future use.
>>>>> 
>>>>> Craig
>>>>> 
>>> 
>>> 
>>> --
>>> Gerry Creager -- gerry.creager at tamu.edu
>>> Texas Mesonet -- AATLT, Texas A&M University
>>> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
>>> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
>> 
> 



More information about the parallel-netcdf mailing list