File locking failed in ADIOI_Set_lock

John Michalakes john at michalakes.us
Fri Sep 24 07:59:36 CDT 2010


  Hi Wei-keng,

Thanks for the response.   By "independent" do you mean non-collective?  
If so, then yes, this version of the code is using the forms of the API 
that do not have "_ALL" at the end of the routine names.  The Sandia 
RedMesa system is not open, unfortunately.  It is Infiniband and running 
OpenMPI.   I received the following response to Rajeev's suggestion from 
the RedMesa admins yesterday:

    Running a global "flock" mount option has the potential to
    introduce file system instability and If this is a mandatory
    requirement for this code to run, I would recommend  that we do
    this as part of the tail end of a system time.  If things look ok,
    then we can enable this as part of your production environment
    (its a fairly easy change).  None of our core MPI codes use this
    feature, so we have little experience with it, but we are always
    willing to learn.

So they're willing to try this, but it'll have to wait til the next 
"system time" (I assume that means the next maintenance period).  Does 
anyone in the pNetCDF group have experience using this option on 
Lustre?  Could it impact stability?

I will try the suggested MPI-IO hints. Thanks very much.

John

On 9/23/2010 7:31 PM, Wei-keng Liao wrote:
> Hi, John,
>
> In addition to the suggestion from Rajeev, could you try use the following MPI-IO hints?
>
> MPI_Info_set(info, "romio_ds_write", "disable");
> MPI_Info_set(info, "romio_ds_read",  "disable");
>
> One question, does your WRF call pnetcdf independent APIs? I suspect this error
> occurs in an independent API. Can you verify that?
>
> I could not access redmesa.sandia.gov (is it open to public?).
> Is it running IB and hence mvapich?
>
> Wei-keng
>
> On Sep 23, 2010, at 3:43 PM, Rajeev Thakur wrote:
>
>> Also try prefixing the file name with lustre: if you can and see if it makes any difference.
>>
>> Rajeev
>>
>>
>> On Sep 23, 2010, at 3:40 PM, Rajeev Thakur wrote:
>>
>>> Based on a user's feedback, we have recently updated that error message to say
>>>
>>> "If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option."
>>>
>>> Rajeev
>>>
>>>
>>>
>>> On Sep 23, 2010, at 3:23 PM, John Michalakes wrote:
>>>
>>>> Hi List,
>>>>
>>>> I'm running WRF on a large cluster with a Lustre file system (redmesa.sandia.gov) and am encountering the following error on one of the processes doing a write through pNetCDF over OpenMPI:
>>>>
>>>> File locking failed in ADIOI_Set_lock(fd 18,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
>>>> If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
>>>> ADIOI_Set_lock:: Function not implemented
>>>> ADIOI_Set_lock:offset 74588, length 804
>>>>
>>>> Have Googled this a little before writing to this list, and the workaround seems to be to upgrade to NFS v3, but that's outside my purview and I don't know if that's applicable to this Lustre system anyway.  Have you seen this before?  Any thoughts or suggestions? Thanks,
>>>>
>>>> John
>>>>
>>>>
>>>> -- 
>>>> John Michalakes
>>>> National Renewable Energy Laboratory
>>>> 1617 Cole Blvd.
>>>> Golden, Colorado 80401
>>>> Phone: 303-275-4297
>>>> Fax: 303-275-4091
>>>>
>>>> John.Michalakes at nrel.gov
>>>>
>>>>
>>>>
>
>

-- 
John Michalakes
National Renewable Energy Laboratory
1617 Cole Blvd.
Golden, Colorado 80401
Phone: 303-275-4297
Fax: 303-275-4091
John.Michalakes at nrel.gov


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20100924/9e26a470/attachment.htm>


More information about the parallel-netcdf mailing list