Hints on improving performance with WRF and Pnetcdf
Wei-keng Liao
wkliao at ece.northwestern.edu
Tue Sep 7 13:18:24 CDT 2010
Hi, Craig,
Do you mean stripe size 16 or stripe count?
Can you describe your I/O patterns? Are you using collective APIs only?
I was not using pnetcdf, but MPI-IO directly. My test program is a 3D array
block-block-block partitioning code modified from ROMIO's coll_perf.c.
You can give it a try: /share/home/00531/tg457823/coll.c
When I ran 1024 processes, it completed in 31 seconds (see output below) and
if I used Ranger mvapich2 natively, the job did not finish and was killed after 10 minutes.
MPI hint: striping_factor = 32
MPI hint: striping_unit = 1048576
Local array size 100 x 100 x 100 integers, size = 3.81 MB
Global array size 1600 x 1600 x 400 integers, write size = 38.15 GB
procs Global array size exec(sec) write(MB/s)
------- ------------------ --------- -----------
1024 1600 x 1600 x 400 31.04 1258.47
Wei-keng
On Sep 7, 2010, at 12:37 PM, Craig Tierney wrote:
> On 9/6/10 10:36 AM, Wei-keng Liao wrote:
>> Gerry,
>>
>> I ran a 1024-PE job yesterday on Ranger using 32 stripe count without a problem.
>> Lustre should not have any problem simply because of the use of a large
>> stripe count. Do you use pnetcdf independent APIs in your program?
>> If you are using collective APIs only, do you access variables partially
>> (i.e. subarrays) or always entire variables? A large number of
>> noncontiguous file accesses may flood the I/O servers and slow down the I/O
>> performance, but that still should not shut down the Lustre. Maybe Ranger's
>> root have a better answer on this.
>>
>> Wei-keng
>>
>
> Wei-keng,
>
> Can you characterize how much after your IO is going when you are using pNetcdf? I am now trying a stripe size of 16, and I am not seeing any improvement.
>
> Craig
>
>> On Sep 6, 2010, at 8:47 AM, Gerry Creager wrote:
>>
>>> Wei-keng
>>>
>>> Thanks. Useful information. I'll look at your ROMIO library later today (about to go into a meeting for the rest of the morning). Last time I set stripe-count to homething above 16, rsl files were also "taking advantage" of that and shut down the LFS. Have you seen this or do you address this in ROMIO?
>>>
>>> gerry
>>>
>>> Wei-keng Liao wrote:
>>>> Hi, Gerry and Craig,
>>>> I would like to provide my experience on Ranger.
>>>> First, I agree with Rob that the most recent optimizations for Lustre
>>>> ADIO driver might not be yet installed on Ranger. Because in my
>>>> experiments on Ranger, the MPI collective write performance is poor.
>>>> I have built a ROMIO library with the recent optimizations for Lustre
>>>> in my home directory and you are welcomed to give it a try. Below is
>>>> the usage example of the library:
>>>> % mpif90 myprogram.o -L/share/home/00531/tg457823/ROMIO/lib -lmpio
>>>> Please note that this library is built using mvapich2 on Ranger. Run the
>>>> command below before compile/link your programs.
>>>> % module load mvapich2
>>>> I usually set the Lustre striping configuration for the output directory
>>>> before I ran applications. I use 1MB stripe size, stripe counts 32,
>>>> 64 or 128, and the stripe offset -1. Since by Lustre default all files
>>>> created under a directory inherit the same striping configuration of
>>>> that directory and my ROMIO built detects these striping configurations
>>>> automatically, there is no need for me to set ROMIO hints in my programs.
>>>> You can verify the striping configuration of a newly created file by
>>>> this command, for example:
>>>> % lfs getstripe -v /scratch/00531/tg457823/FS_1M_32/testfile.dat | grep stripe
>>>> lmm_stripe_count: 32
>>>> lmm_stripe_size: 1048576
>>>> lmm_stripe_pattern: 1
>>>> If you used pnetcdf collective I/O, I recommend to give my ROMIO library a try.
>>>> Wei-keng
>>>> On Sep 5, 2010, at 10:28 AM, Craig Tierney wrote:
>>>>> On 9/4/10 8:25 PM, Gerry Creager wrote:
>>>>>> Rob Latham wrote:
>>>>>>> On Thu, Sep 02, 2010 at 06:23:42PM -0600, Craig Tierney wrote:
>>>>>>>> I did try setting the hints myself by changing the code, and performance
>>>>>>>> still stinks (or is no faster). I was just looking for a way to not
>>>>>>>> have to modify WRF, or more importantly have every user modify WRF.
>>>>>>> What's going slowly?
>>>>>>> If wrf is slowly writing record variables, you might want to try
>>>>>>> disabling collective I/O or carefully selecting the intermediate
>>>>>>> buffer to be as big as one record.
>>>>>>>
>>>>>>> That's the first place I'd look for bad performance.
>>>>>> Ah, but I'm seeing the same thing on Ranger (UTexas). I'm likely going
>>>>>> to have to modify the WRF pnetcdf code to identify a sufficiently large
>>>>>> stripe count (Lustre file system) to see any sort of real improvement.
>>>>>>
>>>>>> More to the point, I see worse performance than with normal Lustre and
>>>>>> regular netcdf. AND, there's no way to set MPI-IO-HINTS in the SGE as
>>>>>> configured on Ranger. We've tried and their systems folk concur, so it's
>>>>>> not just me saying it.
>>>>>>
>>>>> What do you mean you can't? How would you set it in another batch system?
>>>>>
>>>>>> I will look at setting the hints file up but I don't think that's going
>>>>>> to give me the equivalent of 64 stripe counts, which looks like the
>>>>>> sweet spot for the domain I'm testing on.
>>>>>>
>>>>> So what Hints are you passing and is then the key to increase the number
>>>>> of stripes for the directory?
>>>>>
>>>>>> Craig, one I have time to get back on to this, I think we can convince
>>>>>> NCAR to add this as a bug release. I also anticipate the tweak will be
>>>>>> on the order of 4-5 lines.
>>>>>>
>>>>> I already wrote code so that if you set the variable WRF_MPIIO_HINTS, and list all the hints you want to set (comma delimited), then the code in external/io_pnetcdf/wrf_IO.F90 will set the hints for you. When
>>>>> I see that any of this actually helps I will send the patch in for future use.
>>>>>
>>>>> Craig
>>>>>
>>>
>>>
>>> --
>>> Gerry Creager -- gerry.creager at tamu.edu
>>> Texas Mesonet -- AATLT, Texas A&M University
>>> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
>>> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
>>
>
More information about the parallel-netcdf
mailing list