File locking failed in ADIOI_Set_lock

Mark Taylor mataylo at sandia.gov
Fri Sep 24 08:49:06 CDT 2010


Hi John,

I've had a very similar issue a while ago on several older Lustre
filesystems at Sandia, and I can confirm that setting those hints did
allow the code to run (but I could never get pnetcdf to be any faster
than netcdf).  This was with CAM, with pnetcdf being called by PIO, and
PIO has a compiler option to turn this on, -DPIO_LUSTRE_HINTS.    

However, on Sandia's redsky (more-or-less identical to RedMesa), I just
tried these hints and I am also getting those same error messages you
are seeing. So please let me know if you get this resolved.

Mark






On Fri, 2010-09-24 at 06:59 -0600, John Michalakes wrote:
> Hi Wei-keng,
> 
> Thanks for the response.   By "independent" do you mean
> non-collective?  If so, then yes, this version of the code is using
> the forms of the API that do not have "_ALL" at the end of the routine
> names.  The Sandia RedMesa system is not open, unfortunately.  It is
> Infiniband and running OpenMPI.   I received the following response to
> Rajeev's suggestion from the RedMesa admins yesterday:
>         Running a global “flock” mount option has the potential to
>         introduce file system instability and If this is a mandatory
>         requirement for this code to run, I would recommend  that we do
>         this as part of the tail end of a system time.  If things look ok,
>         then we can enable this as part of your production environment
>         (its a fairly easy change).  None of our core MPI codes use this
>         feature, so we have little experience with it, but we are always
>         willing to learn.
> So they're willing to try this, but it'll have to wait til the next
> "system time" (I assume that means the next maintenance period).  Does
> anyone in the pNetCDF group have experience using this option on
> Lustre?  Could it impact stability?
> 
> I will try the suggested MPI-IO hints. Thanks very much. 
> 
> John
> 
> On 9/23/2010 7:31 PM, Wei-keng Liao wrote: 
> > Hi, John,
> > 
> > In addition to the suggestion from Rajeev, could you try use the following MPI-IO hints?
> > 
> > MPI_Info_set(info, "romio_ds_write", "disable");
> > MPI_Info_set(info, "romio_ds_read",  "disable");
> > 
> > One question, does your WRF call pnetcdf independent APIs? I suspect this error
> > occurs in an independent API. Can you verify that?
> > 
> > I could not access redmesa.sandia.gov (is it open to public?).
> > Is it running IB and hence mvapich?
> > 
> > Wei-keng
> > 
> > On Sep 23, 2010, at 3:43 PM, Rajeev Thakur wrote:
> > 
> > > Also try prefixing the file name with lustre: if you can and see if it makes any difference. 
> > > 
> > > Rajeev
> > > 
> > > 
> > > On Sep 23, 2010, at 3:40 PM, Rajeev Thakur wrote:
> > > 
> > > > Based on a user's feedback, we have recently updated that error message to say
> > > > 
> > > > "If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option."
> > > > 
> > > > Rajeev
> > > > 
> > > > 
> > > > 
> > > > On Sep 23, 2010, at 3:23 PM, John Michalakes wrote:
> > > > 
> > > > > Hi List,
> > > > > 
> > > > > I'm running WRF on a large cluster with a Lustre file system (redmesa.sandia.gov) and am encountering the following error on one of the processes doing a write through pNetCDF over OpenMPI:
> > > > > 
> > > > > File locking failed in ADIOI_Set_lock(fd 18,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
> > > > > If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
> > > > > ADIOI_Set_lock:: Function not implemented
> > > > > ADIOI_Set_lock:offset 74588, length 804
> > > > > 
> > > > > Have Googled this a little before writing to this list, and the workaround seems to be to upgrade to NFS v3, but that's outside my purview and I don't know if that's applicable to this Lustre system anyway.  Have you seen this before?  Any thoughts or suggestions? Thanks,
> > > > > 
> > > > > John  
> > > > > 
> > > > > 
> > > > > -- 
> > > > > John Michalakes
> > > > > National Renewable Energy Laboratory
> > > > > 1617 Cole Blvd.
> > > > > Golden, Colorado 80401
> > > > > Phone: 303-275-4297
> > > > > Fax: 303-275-4091
> > > > > 
> > > > > John.Michalakes at nrel.gov
> > > > > 
> > > > > 
> > > > > 
> > 
> > 
> 
> -- 
> John Michalakes
> National Renewable Energy Laboratory
> 1617 Cole Blvd.
> Golden, Colorado 80401
> Phone: 303-275-4297
> Fax: 303-275-4091
> John.Michalakes at nrel.gov
> 




More information about the parallel-netcdf mailing list