Hints on improving performance with WRF and Pnetcdf

Craig Tierney Craig.Tierney at noaa.gov
Wed Sep 8 08:45:17 CDT 2010


Don't use the patch below.  I found an obvious problem (set
the hints, but didn't pass them to NFMPI_Create/Open).  But
I am having issues now with it and I am trying to resolve the
problem.

Craig


On 9/7/10 11:58 AM, Craig Tierney wrote:
> On 9/6/10 9:46 AM, Gerald Creager wrote:
>> Craig Tierney wrote:
>>> On 9/6/10 4:55 AM, Gerry Creager wrote:
>>>> Craig Tierney wrote:
>>>>> On 9/4/10 8:25 PM, Gerry Creager wrote:
>>>>>> Rob Latham wrote:
>>>>>>> On Thu, Sep 02, 2010 at 06:23:42PM -0600, Craig Tierney wrote:
>>>>>>>> I did try setting the hints myself by changing the code, and
>>>>>>>> performance
>>>>>>>> still stinks (or is no faster). I was just looking for a way to not
>>>>>>>> have to modify WRF, or more importantly have every user modify WRF.
>>>>>>>
>>>>>>> What's going slowly?
>>>>>>> If wrf is slowly writing record variables, you might want to try
>>>>>>> disabling collective I/O or carefully selecting the intermediate
>>>>>>> buffer to be as big as one record.
>>>>>>>
>>>>>>> That's the first place I'd look for bad performance.
>>>>>>
>>>>>> Ah, but I'm seeing the same thing on Ranger (UTexas). I'm likely
>>>>>> going
>>>>>> to have to modify the WRF pnetcdf code to identify a sufficiently
>>>>>> large
>>>>>> stripe count (Lustre file system) to see any sort of real
>>>>>> improvement.
>>>>>>
>>>>>> More to the point, I see worse performance than with normal Lustre
>>>>>> and
>>>>>> regular netcdf. AND, there's no way to set MPI-IO-HINTS in the SGE as
>>>>>> configured on Ranger. We've tried and their systems folk concur, so
>>>>>> it's
>>>>>> not just me saying it.
>>>>>>
>>>>>
>>>>> What do you mean you can't? How would you set it in another batch
>>>>> system?
>>>>
>>>> Pretty much that. In SGE as installed at TACC, it doesn't pass
>>>> anything.
>>>> That's not to say it won't work with SGE, but not with SGE as installed
>>>> at TACC.
>>>>
>>>
>>> Still not clear. What can you pass to make this work? What doesn't SGE
>>> pass? Are you saying there is an environment variables which can be
>>> used to pass hints to the application but TACC doesn't support it? Why
>>> can't you use -v, or put it in your batch script and tell mpirun to pass
>>> the variable or put it on the mpirun command line when you pass it.
>>
>> That's exactly it, though. When I tried to pass the variables using -v,
>> it, well, didn't pass them, and their systems folks confirmed that it
>> doesn't support that. It's also possible they're less familiar with how
>> pneetcdf and Lstre interact, since they elected to support HDF5 (and
>> netcdf4, by extension), and felt that pnetcdf was thus unnecessary.
>>
>> The environment variables and hints I was referring to were from the
>> Cray documentation.
>>
>>>>>> I will look at setting the hints file up but I don't think that's
>>>>>> going
>>>>>> to give me the equivalent of 64 stripe counts, which looks like the
>>>>>> sweet spot for the domain I'm testing on.
>>>>>>
>>>>>
>>>>> So what Hints are you passing and is then the key to increase the
>>>>> number
>>>>> of stripes for the directory?
>>>>
>>>> The key is stripe-count. BUT only for the wrfout files. I've tried
>>>> changing the stripe-count on the directory, and that did improve
>>>> performance transiently... until they killed my job and rebooted Ranger
>>>> because the rsl.* files were ALSO being written with stripe-count=64,
>>>> which had crashed their Lustre file system. Unintended Consequences has
>>>> not been repealed.
>>>>
>>>
>>> Is stripe-count a hint, or are you just setting it with lfs stripe -c
>>> <stripe-count>. Why is it only for the wrfout files? Does it not help
>>> the wrfrst files?
>>
>> It will help the restart files, too. What will kill performance is the
>> rsl.* files which are written by each node.
>>
>> stripe-count is set on a file at creation. If we could create the rsl's
>> and then reset their stripe-count to 1 (or 0, which results in the
>> default) then all would be OK. Alternately, if we could set stripe count
>> (I really want to see your code, as I suspect it'd allow some tweaking
>> for this) at file creation time for wrfout and wrfrst files, we'd be
>> just fine.
>>
>>> Why I would do to get around this, is I knew what files were going to be
>>> created, I would create a separate subdirectory, change the
>>> stripe-count on that directory, then create links of the files to be
>>> created into that directory. When WRF tries to create the wrfout files,
>>> then they get written to the directory that has a different
>>> stripe-count.
>>
>> I've tried that but been relatively unsuccessful in writing them to that
>> directory. I'm probably doing something wrong with the redirection
>> command. Been busy with user support issued and a gluster hardware and
>> software failure so I've not spent the several hours it'd take to sort
>> this all out.
>>
>>>>>> Craig, one I have time to get back on to this, I think we can
>>>>>> convince
>>>>>> NCAR to add this as a bug release. I also anticipate the tweak
>>>>>> will be
>>>>>> on the order of 4-5 lines.
>>>>>>
>>>>>
>>>>> I already wrote code so that if you set the variable WRF_MPIIO_HINTS,
>>>>> and list all the hints you want to set (comma delimited), then the
>>>>> code in external/io_pnetcdf/wrf_IO.F90 will set the hints for you.
>>>>> When
>>>>> I see that any of this actually helps I will send the patch in for
>>>>> future use.
>>>>
>>>> Care to share?
>>>>
>>>> Thanks, gerry
>>>
>>> I will post tomorrow.
>>
>> Cool, THANKS!
>>
>> gerry
>>
>
> Here is the patch. The patch applies to WRF v3.0.1.1 but I have looked
> at the routines in later versions and it should apply cleanly to v.3.2.
> It compiles and seems to do the right thing
> because error codes come back clean, but I haven't actually gotten
> any better performance out of the patch.
>
> You can set all the variables you want to use with WRF_MPIIO_HINTS.
> For example:
>
> setenv WRF_MPIIO_HINTS romio_ds_read=disable,ind_wr_buffer_size=16777216
>
> Then use your favorite method to pass variables through mpirun to
> your executable.
>
> Craig
>
> --- external/io_pnetcdf/wrf_io.F90.orig 2010-09-02 23:08:21.000000000 +0000
> +++ external/io_pnetcdf/wrf_io.F90 2010-09-07 17:12:17.000000000 +0000
> @@ -921,6 +921,7 @@
> integer :: NumVars
> integer :: i
> character (NF_MAX_NAME) :: Name
> + integer :: mpiio_hints,set_mpiio_hints
>
> if(WrfIOnotInitialized) then
> Status = WRF_IO_NOT_INITIALIZED
> @@ -934,7 +935,16 @@
> call wrf_debug ( WARN , TRIM(msg))
> return
> endif
> +
> +!
> +! CCT Attempted patch to make WRF/Lustre/Pnetcdf go faster
> +!
> +
> + mpiio_hints=set_mpiio_hints()
> stat = NFMPI_OPEN(Comm, FileName, NF_NOWRITE, MPI_INFO_NULL, DH%NCID)
> + call clear_mpiio_hints(mpiio_hints)
> +
> +
> call netcdf_err(stat,Status)
> if(Status /= WRF_NO_ERR) then
> write(msg,*) 'NetCDF error in ',__FILE__,', line', __LINE__
> @@ -1179,6 +1189,8 @@
> integer :: stat
> character (7) :: Buffer
> integer :: VDimIDs(2)
> +! CCT Patch for Pnetcdf/Lustre
> + integer :: mpiio_hints,set_mpiio_hints
>
> if(WrfIOnotInitialized) then
> Status = WRF_IO_NOT_INITIALIZED
> @@ -1194,8 +1206,11 @@
> endif
> DH%TimeIndex = 0
> DH%Times = ZeroDate
> -! stat = NFMPI_CREATE(Comm, FileName, NF_CLOBBER, MPI_INFO_NULL, DH%NCID)
> +
> + mpiio_hints=set_mpiio_hints()
> stat = NFMPI_CREATE(Comm, FileName, IOR(NF_CLOBBER, NF_64BIT_OFFSET),
> MPI_INFO_NULL, DH%NCID)
> + call clear_mpiio_hints(mpiio_hints)
> +
> call netcdf_err(stat,Status)
> if(Status /= WRF_NO_ERR) then
> write(msg,*) 'NetCDF error in ext_pnc_open_for_write_begin ',__FILE__,',
> line', __LINE__
> @@ -3412,3 +3427,95 @@
>
> return
> end subroutine ext_pnc_error_str
> +
> +
> +function set_mpiio_hints
> + use wrf_data_pnc
> + implicit none
> +
> + integer :: ierror,info, set_mpiio_hints
> + character*256 :: hintsstr,Hkey,Hvalue,kvstr
> + integer :: pos,l,i,mpos,kpos,divide_string
> +
> + call getenv("WRF_MPIIO_HINTS", hintsstr)
> + if (len(hintsstr) .ne. 0) then
> + call mpi_info_create(info, ierror)
> + if (ierror .ne. 0) then
> + write(msg,*) 'Error, Unable to create info structure for MPIIO hints:
> ',ierror
> + call wrf_debug(WARN, TRIM(msg))
> + set_mpiio_hints=info
> + return
> + endif
> + pos=1
> + l=len(trim(hintsstr))
> + do while (pos .le. l)
> + mpos=divide_string(',',hintsstr(pos:len(trim(hintsstr))))
> + if (mpos .gt. 0) then
> + kvstr=hintsstr(pos:pos+mpos-2)
> + else
> + mpos=l
> + kvstr=hintsstr(pos:mpos)
> + endif
> + kpos=divide_string('=',kvstr)
> + if (kpos .eq. 0) then
> + write(msg,*) 'WARNING: MPI-IO Key/Value pair not set correctly: ',
> kvstr(1:len(trim(kvstr)))
> + call wrf_debug(WARN, TRIM(msg))
> + else
> + Hkey=kvstr(1:kpos-1)
> + Hvalue=kvstr(kpos+1:)
> + write(msg,*) 'INFO: Found MPI-IO Hint: ',
> Hkey(1:len(trim(Hkey))),'/',Hvalue(1:len(trim(Hvalue)))
> + call wrf_debug(WARN, TRIM(msg))
> +
> + call
> mpi_info_set(info,Hkey(1:len(trim(Hkey))),Hvalue(1:len(trim(Hvalue))),ierror)
>
> + if (ierror .ne. 0) then
> + write(msg,*) 'WARNING: Unable to set MPI-IO Hint:
> ',Hkey(1:len(trim(Hkey))),'/',Hvalue(1:len(trim(Hvalue)))
> + endif
> + endif
> + pos=pos+mpos
> + enddo
> + else
> + set_mpiio_hints=0
> + endif
> +
> + set_mpiio_hints=info
> + return
> +end function set_mpiio_hints
> +
> +subroutine clear_mpiio_hints(mpiio_hints)
> +
> + implicit none
> + integer mpiio_hints,ierror
> +
> + if (mpiio_hints .ne. 0) then
> + call mpi_info_free(mpiio_hints, ierror)
> + endif
> +
> +end subroutine clear_mpiio_hints
> +
> +function divide_string(marker,str)
> +
> +! Find where the marker is within the string
> +! returns its postion
> +! return 0 if it is not found
> + character :: marker
> + character(*) :: str
> + integer :: divide_string
> + integer :: pos
> + integer :: l
> +
> + pos=1
> + l=len(trim(str))
> +
> + do while (pos .le. l)
> + if (str(pos:pos) .eq. marker) then
> + divide_string=pos
> + return
> + endif
> + pos=pos+1
> + enddo
> +
> + divide_string=0
> +
> + return
> +end function divide_string
> +
>
>
>
>



More information about the parallel-netcdf mailing list