Hints on improving performance with WRF and Pnetcdf

Craig Tierney Craig.Tierney at noaa.gov
Tue Sep 7 12:58:51 CDT 2010


On 9/6/10 9:46 AM, Gerald Creager wrote:
> Craig Tierney wrote:
>> On 9/6/10 4:55 AM, Gerry Creager wrote:
>>> Craig Tierney wrote:
>>>> On 9/4/10 8:25 PM, Gerry Creager wrote:
>>>>> Rob Latham wrote:
>>>>>> On Thu, Sep 02, 2010 at 06:23:42PM -0600, Craig Tierney wrote:
>>>>>>> I did try setting the hints myself by changing the code, and
>>>>>>> performance
>>>>>>> still stinks (or is no faster). I was just looking for a way to not
>>>>>>> have to modify WRF, or more importantly have every user modify WRF.
>>>>>>
>>>>>> What's going slowly?
>>>>>> If wrf is slowly writing record variables, you might want to try
>>>>>> disabling collective I/O or carefully selecting the intermediate
>>>>>> buffer to be as big as one record.
>>>>>>
>>>>>> That's the first place I'd look for bad performance.
>>>>>
>>>>> Ah, but I'm seeing the same thing on Ranger (UTexas). I'm likely going
>>>>> to have to modify the WRF pnetcdf code to identify a sufficiently
>>>>> large
>>>>> stripe count (Lustre file system) to see any sort of real improvement.
>>>>>
>>>>> More to the point, I see worse performance than with normal Lustre and
>>>>> regular netcdf. AND, there's no way to set MPI-IO-HINTS in the SGE as
>>>>> configured on Ranger. We've tried and their systems folk concur, so
>>>>> it's
>>>>> not just me saying it.
>>>>>
>>>>
>>>> What do you mean you can't? How would you set it in another batch
>>>> system?
>>>
>>> Pretty much that. In SGE as installed at TACC, it doesn't pass anything.
>>> That's not to say it won't work with SGE, but not with SGE as installed
>>> at TACC.
>>>
>>
>> Still not clear. What can you pass to make this work? What doesn't SGE
>> pass? Are you saying there is an environment variables which can be
>> used to pass hints to the application but TACC doesn't support it? Why
>> can't you use -v, or put it in your batch script and tell mpirun to pass
>> the variable or put it on the mpirun command line when you pass it.
>
> That's exactly it, though. When I tried to pass the variables using -v,
> it, well, didn't pass them, and their systems folks confirmed that it
> doesn't support that. It's also possible they're less familiar with how
> pneetcdf and Lstre interact, since they elected to support HDF5 (and
> netcdf4, by extension), and felt that pnetcdf was thus unnecessary.
>
> The environment variables and hints I was referring to were from the
> Cray documentation.
>
>>>>> I will look at setting the hints file up but I don't think that's
>>>>> going
>>>>> to give me the equivalent of 64 stripe counts, which looks like the
>>>>> sweet spot for the domain I'm testing on.
>>>>>
>>>>
>>>> So what Hints are you passing and is then the key to increase the
>>>> number
>>>> of stripes for the directory?
>>>
>>> The key is stripe-count. BUT only for the wrfout files. I've tried
>>> changing the stripe-count on the directory, and that did improve
>>> performance transiently... until they killed my job and rebooted Ranger
>>> because the rsl.* files were ALSO being written with stripe-count=64,
>>> which had crashed their Lustre file system. Unintended Consequences has
>>> not been repealed.
>>>
>>
>> Is stripe-count a hint, or are you just setting it with lfs stripe -c
>> <stripe-count>. Why is it only for the wrfout files? Does it not help
>> the wrfrst files?
>
> It will help the restart files, too. What will kill performance is the
> rsl.* files which are written by each node.
>
> stripe-count is set on a file at creation. If we could create the rsl's
> and then reset their stripe-count to 1 (or 0, which results in the
> default) then all would be OK. Alternately, if we could set stripe count
> (I really want to see your code, as I suspect it'd allow some tweaking
> for this) at file creation time for wrfout and wrfrst files, we'd be
> just fine.
>
>> Why I would do to get around this, is I knew what files were going to be
>> created, I would create a separate subdirectory, change the
>> stripe-count on that directory, then create links of the files to be
>> created into that directory. When WRF tries to create the wrfout files,
>> then they get written to the directory that has a different stripe-count.
>
> I've tried that but been relatively unsuccessful in writing them to that
> directory. I'm probably doing something wrong with the redirection
> command. Been busy with user support issued and a gluster hardware and
> software failure so I've not spent the several hours it'd take to sort
> this all out.
>
>>>>> Craig, one I have time to get back on to this, I think we can convince
>>>>> NCAR to add this as a bug release. I also anticipate the tweak will be
>>>>> on the order of 4-5 lines.
>>>>>
>>>>
>>>> I already wrote code so that if you set the variable WRF_MPIIO_HINTS,
>>>> and list all the hints you want to set (comma delimited), then the
>>>> code in external/io_pnetcdf/wrf_IO.F90 will set the hints for you. When
>>>> I see that any of this actually helps I will send the patch in for
>>>> future use.
>>>
>>> Care to share?
>>>
>>> Thanks, gerry
>>
>> I will post tomorrow.
>
> Cool, THANKS!
>
> gerry
>

Here is the patch.  The patch applies to WRF v3.0.1.1 but I have looked
at the routines in later versions and it should apply cleanly to v.3.2. 
  It compiles and seems to do the right thing
because error codes come back clean, but I haven't actually gotten
any better performance out of the patch.

You can set all the variables you want to use with WRF_MPIIO_HINTS.
For example:

setenv WRF_MPIIO_HINTS romio_ds_read=disable,ind_wr_buffer_size=16777216

Then use your favorite method to pass variables through mpirun to
your executable.

Craig

--- external/io_pnetcdf/wrf_io.F90.orig	2010-09-02 23:08:21.000000000 +0000
+++ external/io_pnetcdf/wrf_io.F90	2010-09-07 17:12:17.000000000 +0000
@@ -921,6 +921,7 @@
    integer                                :: NumVars
    integer                                :: i
    character (NF_MAX_NAME)                :: Name
+  integer                                :: mpiio_hints,set_mpiio_hints

    if(WrfIOnotInitialized) then
      Status = WRF_IO_NOT_INITIALIZED
@@ -934,7 +935,16 @@
      call wrf_debug ( WARN , TRIM(msg))
      return
    endif
+
+!
+! CCT Attempted patch to make WRF/Lustre/Pnetcdf go faster
+!
+
+  mpiio_hints=set_mpiio_hints()
    stat = NFMPI_OPEN(Comm, FileName, NF_NOWRITE, MPI_INFO_NULL, DH%NCID)
+  call clear_mpiio_hints(mpiio_hints)
+
+
    call netcdf_err(stat,Status)
    if(Status /= WRF_NO_ERR) then
      write(msg,*) 'NetCDF error in ',__FILE__,', line', __LINE__
@@ -1179,6 +1189,8 @@
    integer                           :: stat
    character (7)                     :: Buffer
    integer                           :: VDimIDs(2)
+! CCT Patch for Pnetcdf/Lustre
+  integer                           :: mpiio_hints,set_mpiio_hints

    if(WrfIOnotInitialized) then
      Status = WRF_IO_NOT_INITIALIZED
@@ -1194,8 +1206,11 @@
    endif
    DH%TimeIndex = 0
    DH%Times     = ZeroDate
-!  stat = NFMPI_CREATE(Comm, FileName, NF_CLOBBER, MPI_INFO_NULL, DH%NCID)
+
+   mpiio_hints=set_mpiio_hints()
     stat = NFMPI_CREATE(Comm, FileName, IOR(NF_CLOBBER, 
NF_64BIT_OFFSET), MPI_INFO_NULL, DH%NCID)
+   call clear_mpiio_hints(mpiio_hints)
+
    call netcdf_err(stat,Status)
    if(Status /= WRF_NO_ERR) then
      write(msg,*) 'NetCDF error in ext_pnc_open_for_write_begin 
',__FILE__,', line', __LINE__
@@ -3412,3 +3427,95 @@

    return
  end subroutine ext_pnc_error_str
+
+
+function set_mpiio_hints
+  use wrf_data_pnc
+  implicit none
+
+  integer                           :: ierror,info, set_mpiio_hints
+  character*256                     :: hintsstr,Hkey,Hvalue,kvstr
+  integer                           :: pos,l,i,mpos,kpos,divide_string
+
+  call getenv("WRF_MPIIO_HINTS", hintsstr)
+  if (len(hintsstr) .ne. 0) then
+    call mpi_info_create(info, ierror)
+    if (ierror .ne. 0) then
+       write(msg,*) 'Error, Unable to create info structure for MPIIO 
hints: ',ierror
+       call wrf_debug(WARN, TRIM(msg))
+       set_mpiio_hints=info
+       return
+    endif
+    pos=1
+    l=len(trim(hintsstr))
+    do while (pos .le. l)
+       mpos=divide_string(',',hintsstr(pos:len(trim(hintsstr))))
+       if (mpos .gt. 0) then
+           kvstr=hintsstr(pos:pos+mpos-2)
+       else
+           mpos=l
+           kvstr=hintsstr(pos:mpos)
+       endif
+       kpos=divide_string('=',kvstr)
+       if (kpos .eq. 0) then
+         write(msg,*) 'WARNING: MPI-IO Key/Value pair not set 
correctly: ', kvstr(1:len(trim(kvstr)))
+         call wrf_debug(WARN, TRIM(msg))
+       else
+         Hkey=kvstr(1:kpos-1)
+         Hvalue=kvstr(kpos+1:)
+         write(msg,*) 'INFO: Found MPI-IO Hint: ', 
Hkey(1:len(trim(Hkey))),'/',Hvalue(1:len(trim(Hvalue)))
+         call wrf_debug(WARN, TRIM(msg))
+
+         call 
mpi_info_set(info,Hkey(1:len(trim(Hkey))),Hvalue(1:len(trim(Hvalue))),ierror)
+         if (ierror .ne. 0) then
+           write(msg,*) 'WARNING: Unable to set MPI-IO Hint: 
',Hkey(1:len(trim(Hkey))),'/',Hvalue(1:len(trim(Hvalue)))
+         endif
+       endif
+       pos=pos+mpos
+    enddo
+  else
+    set_mpiio_hints=0
+  endif
+
+  set_mpiio_hints=info
+  return
+end function set_mpiio_hints
+
+subroutine clear_mpiio_hints(mpiio_hints)
+
+  implicit none
+  integer mpiio_hints,ierror
+
+   if (mpiio_hints .ne. 0) then
+      call mpi_info_free(mpiio_hints, ierror)
+   endif
+
+end subroutine clear_mpiio_hints
+
+function divide_string(marker,str)
+
+! Find where the marker is within the string
+! returns its postion
+! return 0 if it is not found
+     character    :: marker
+     character(*) :: str
+     integer      :: divide_string
+     integer      :: pos
+     integer      :: l
+
+     pos=1
+     l=len(trim(str))
+
+     do while (pos .le. l)
+       if (str(pos:pos) .eq. marker) then
+         divide_string=pos
+         return
+       endif
+       pos=pos+1
+     enddo
+
+     divide_string=0
+
+     return
+end function divide_string
+






More information about the parallel-netcdf mailing list