File locking failed in ADIOI_Set_lock

John Michalakes john at michalakes.us
Mon Sep 27 15:17:49 CDT 2010


  Thanks all for your responses and your suggestions.  I have 
re-engineered the code to use only collective I/O but I am still seeing 
the error I first wrote about:

File locking failed in ADIOI_Set_lock(fd 16,cmd F_SETLKW/7,type 
F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
If the file system is NFS, you need to use NFS version 3, ensure that 
the lockd daemon is running on all the machines, and mount the directory 
with the 'noac' option (no attribute caching).
ADIOI_Set_lock:: Function not implemented
ADIOI_Set_lock:offset 65492, length 1980

So this probably isn't an effect from using the independent API after 
all (drat!).  I will try some of the other suggestions now -- first of 
which will be to upgrade to the latest pNetCDF on this machine.  I'm not 
sure I'll be able to switch over to MPICH2/romio but I'll look into that 
as well.

Thanks,

John


On 9/24/2010 9:53 AM, Wei-keng Liao wrote:
> Hi, John,
>
> Rob is right, turning off data sieving just avoids the error messages and may
> significantly slow down the performance (if your I/O is non-contiguous and
> only calls non-collective.)
>
> On Lustre, you need collective I/O to get high I/O bandwidths. Even if your
> write request from each process is already large and contiguous, non-collective
> write will still give you poor results. This is not the case on other file
> systems, e.g. GPFS.
>
> Is there an option to turn on collective I/O in WRF and PIO?
>
> If non-collective I/O is the only option (due to the irregular data
> distribution), then non-blocking I/O is another solution. In
> pnetcdf 1.2.0, non-blocking I/O can aggregate multiple
> non-collective requests into a single collective one. However, this
> approach requires changes to the pnetcdf calls in the I/O library
> used by WRF and PIO. The changes should be very simple, though.
> In general, I would suggest Lustre users to seek all opportunity to call
> collective I/O.
>
> Wei-keng
>
> On Sep 24, 2010, at 9:44 AM, Rob Latham wrote:
>
>> On Fri, Sep 24, 2010 at 07:49:06AM -0600, Mark Taylor wrote:
>>> Hi John,
>>>
>>> I've had a very similar issue a while ago on several older Lustre
>>> filesystems at Sandia, and I can confirm that setting those hints did
>>> allow the code to run
>> if you turn off data sieving then there will be no more lock calls.
>> Depending on how your application partitions the arrays, that could be
>> fine, or it could result in a billion 8 byte operations.
>>
>>> (but I could never get pnetcdf to be any faster
>>> than netcdf).
>> Unsurprising, honestly.  If you are dealing with Lustre, then you must
>> both use an updated ROMIO and use collective I/O.
>>
>> Here is the current list of MPI-IO implementations that work well with
>> Lustre:
>>
>> - Cray MPT 3.2 or newer
>> - MPICH2-1.3.0a1 or newer
>> - and that's it.
>>
>> I think the OpenMPI community is working on a re-sync with MPICH2
>> romio.  I also think we can stitch together a patch against OpenMPI if
>> you really need the improved lustre driver.   I'm not really in
>> patch-generating mode right now, but maybe once i'm back in the office
>> I can see how tricky it will be.
>>
>>> This was with CAM, with pnetcdf being called by PIO, and
>>> PIO has a compiler option to turn this on, -DPIO_LUSTRE_HINTS.
>>>
>>> However, on Sandia's redsky (more-or-less identical to RedMesa), I just
>>> tried these hints and I am also getting those same error messages you
>>> are seeing. So please let me know if you get this resolved.
>> I can't think of any other code paths that use locking, unless your
>> system for some reason presents itself as nfs.
>>
>> That's why rajeev suggested prefixing with lustre: Unfortunately, that
>> won't help: it has only been since March of this year that (with
>> community support)  the Lustre driver in MPICH2 passed all the ROMIO
>> tests, and now we need to get that into OpenMPI.
>>
>> ==rob
>>
>> -- 
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>>
>
>

-- 
John Michalakes
National Renewable Energy Laboratory
1617 Cole Blvd.
Golden, Colorado 80401
Phone: 303-275-4297
Fax: 303-275-4091
John.Michalakes at nrel.gov




More information about the parallel-netcdf mailing list