[parallel-netcdf] #21: File system locking error in testing

Wei-keng Liao wkliao at eecs.northwestern.edu
Mon Oct 31 10:01:34 CDT 2016


Hi, Luke

If the output Lustre folder is the same for both runs built by
Intel MPI and OpenMPI runs, then I would say most likely the
Intel MPI configuration is not done correctly. I suggest you
replort this error to your system admin with the simple MPI
program I provided earlier.

If you like, you can also post this to mpich discuss mailing
list: <discuss at mpich.org>. Rob Latham is the lead developer
of ROMIO (MPICH's MPI-IO component). He and others in MPICH team
may provide more information.

Wei-keng

On Oct 31, 2016, at 9:45 AM, Luke Van Roekel wrote:

> Wei-keng,
>   You were right about the mismatch.  With the fix, I know get the same ADIOI_set_lock error as in my first submission.  With openmpi the program runs fine.
> 
> Regards,
> Luke
> 
> On Sun, Oct 30, 2016 at 10:28 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
> Hi, Luke
> 
> The error message could be caused by using a mpiexec/mpirun that is
> not of the same build as mpicc used to compile the MPI program.
> Could you check the path of mpiexec/mpirun to see whether it is in the
> same folder as the Intel mpicc? However, this dose not seem to relate
> to the ADIOI_Set_lock problem you first reported. But do let me know
> if you get the above mpirun issue resolved and then we can check the lock
> problem after.
> 
> Wei-keng
> 
> On Oct 28, 2016, at 10:23 PM, Luke Van Roekel wrote:
> 
> > Hello Wei-Keng,
> >   Sorry for the slow turn around on this test.  Our computing resources have been down all week and just came back.  Openmpi succeeded, but intel-mpi failed with the following error.
> >
> > [proxy:0:0 at gr1224.localdomain] HYD_pmcd_pmi_args_to_tokens (../../pm/pmiserv/common.c:276): assert (*count * sizeof(struct HYD_pmcd_token)) failed
> > [proxy:0:0 at gr1224.localdomain] fn_job_getid (../../pm/pmiserv/pmip_pmi_v2.c:253): unable to convert args to tokens
> > [proxy:0:0 at gr1224.localdomain] pmi_cb (../../pm/pmiserv/pmip_cb.c:806): PMI handler returned error
> > [proxy:0:0 at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> > [proxy:0:0 at gr1224.localdomain] main (../../pm/pmiserv/pmip.c:507): demux engine error waiting for event
> > [mpiexec at gr1224.localdomain] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host gr1224 failed
> > [mpiexec at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> > [mpiexec at gr1224.localdomain] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
> > [mpiexec at gr1224.localdomain] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
> >
> > Does this mean that our intel-mpi implementation has an issue(s)?
> > Regards,
> > Luke
> >
> > On Mon, Oct 24, 2016 at 11:02 PM, parallel-netcdf <parallel-netcdf at mcs.anl.gov> wrote:
> > #21: File system locking error in testing
> > --------------------------------------+-------------------------------------
> >  Reporter:  luke.vanroekel@…          |       Owner:  robl
> >      Type:  test error                |      Status:  new
> >  Priority:  major                     |   Milestone:
> > Component:  parallel-netcdf           |     Version:  1.7.0
> >  Keywords:                            |
> > --------------------------------------+-------------------------------------
> >
> > Comment(by wkliao):
> >
> >  Hi, Luke
> >
> >  We just resolved an issue of trac notification email setting today. I
> >  believe from now on
> >  any update to the ticket you created should reach you through email.
> >
> >  I assume you ran PnetCDF tests using Intel MPI and OpenMPI on the same
> >  machine
> >  accessing the same Lustre file system. If this is the case, I am also
> >  puzzled.
> >  If OpenMPI works, then it implies the Lustre directory is mounted with the
> >  'flock' option, which should have worked fine with Intel MPI. I would
> >  suggest you
> >  try a simple MPI-IO program below. If the same problem occurs, then it is
> >  an
> >  MPI-IO problem. Let me know.
> >
> >  {{{
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <mpi.h>
> >
> >  int main(int argc, char **argv) {
> >      int            buf, err;
> >      MPI_File       fh;
> >      MPI_Status     status;
> >
> >      MPI_Init(&argc, &argv);
> >      if (argc != 2) {
> >          printf("Usage: %s filename\n", argv[0]);
> >          MPI_Finalize();
> >          return 1;
> >      }
> >      err = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_CREATE |
> >  MPI_MODE_RDWR,
> >                          MPI_INFO_NULL, &fh);
> >      if (err != MPI_SUCCESS) printf("Error: MPI_File_open()\n");
> >
> >      err = MPI_File_write_all(fh, &buf, 1, MPI_INT, &status);
> >      if (err != MPI_SUCCESS) printf("Error: MPI_File_write_all()\n");
> >
> >      MPI_File_close(&fh);
> >      MPI_Finalize();
> >      return 0;
> >  }
> >  }}}
> >
> >  Wei-keng
> >
> >
> >  Replying to [ticket:24 luke.vanroekel@…]:
> >  > In trying to respond to the question raised about my ticket #21, I am
> >  unable to do so. I don't see any reply option
> >  > or modify ticket. Sorry for raising another ticket, but I cannot figure
> >  out how to respond to the previous question.
> >  >
> >  > In regards to the question in Ticket 21, the flag is not set for
> >  locking. My confusion is why intel mpi requires file
> >  > locking while openmpi does not. Our hpc staff will not change settings
> >  on the mount. Is it possible to work
> >  > around the file-lock error?
> >  >
> >  > Regards, Luke
> >
> >
> >  Replying to [ticket:21 luke.vanroekel@…]:
> >  > Hello,
> >  >   I've been attempting to build parallel-netcdf for our local cluster
> >  with gcc and intel-mpi 5.1.3 and netcdf 4.3.2.  The code compiles fine,
> >  but when I run make check testing, nc_test fails with the following error
> >  >
> >  >
> >  > {{{
> >  > This requires fcntl(2) to be implemented. As of 8/25/2011 it is not.
> >  Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 3,cmd
> >  F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno
> >  26.
> >  > - If the file system is NFS, you need to use NFS version 3, ensure that
> >  the lockd daemon is running on all the machines, and mount the directory
> >  with the 'noac' option (no attribute caching).
> >  > - If the file system is LUSTRE, ensure that the directory is mounted
> >  with the 'flock' option.
> >  > ADIOI_Set_lock:: Function not implemented
> >  > ADIOI_Set_lock:offset 0, length 6076
> >  > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> >  >
> >  > }}}
> >  >
> >  > I am running this test on a parallel file system (lustre).  I have
> >  tested this in versions 1.5.0 up to the most current.  Any thoughts?  I
> >  can compile and test just fine with openmpi 1.10.3.
> >  >
> >  > Regards,
> >  > Luke
> >
> > --
> > Ticket URL: <http://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/21#comment:2>
> > parallel-netcdf <http://trac.mcs.anl.gov/projects/parallel-netcdf>
> >
> >
> 
> 



More information about the parallel-netcdf mailing list