[parallel-netcdf] #21: File system locking error in testing

Luke Van Roekel luke.vanroekel at gmail.com
Fri Oct 28 22:23:33 CDT 2016


Hello Wei-Keng,
  Sorry for the slow turn around on this test.  Our computing resources
have been down all week and just came back.  Openmpi succeeded, but
intel-mpi failed with the following error.

[proxy:0:0 at gr1224.localdomain] HYD_pmcd_pmi_args_to_tokens
(../../pm/pmiserv/common.c:276): assert (*count * sizeof(struct
HYD_pmcd_token)) failed

[proxy:0:0 at gr1224.localdomain] fn_job_getid
(../../pm/pmiserv/pmip_pmi_v2.c:253): unable to convert args to tokens

[proxy:0:0 at gr1224.localdomain] pmi_cb (../../pm/pmiserv/pmip_cb.c:806): PMI
handler returned error

[proxy:0:0 at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event
(../../tools/demux/demux_poll.c:76): callback returned error status

[proxy:0:0 at gr1224.localdomain] main (../../pm/pmiserv/pmip.c:507): demux
engine error waiting for event

[mpiexec at gr1224.localdomain] control_cb
(../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host gr1224
failed

[mpiexec at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event
(../../tools/demux/demux_poll.c:76): callback returned error status

[mpiexec at gr1224.localdomain] HYD_pmci_wait_for_completion
(../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event

[mpiexec at gr1224.localdomain] main (../../ui/mpich/mpiexec.c:1130): process
manager error waiting for completion


Does this mean that our intel-mpi implementation has an issue(s)?
Regards,
Luke

On Mon, Oct 24, 2016 at 11:02 PM, parallel-netcdf <
parallel-netcdf at mcs.anl.gov> wrote:

> #21: File system locking error in testing
> --------------------------------------+---------------------
> ----------------
>  Reporter:  luke.vanroekel@…          |       Owner:  robl
>      Type:  test error                |      Status:  new
>  Priority:  major                     |   Milestone:
> Component:  parallel-netcdf           |     Version:  1.7.0
>  Keywords:                            |
> --------------------------------------+---------------------
> ----------------
>
> Comment(by wkliao):
>
>  Hi, Luke
>
>  We just resolved an issue of trac notification email setting today. I
>  believe from now on
>  any update to the ticket you created should reach you through email.
>
>  I assume you ran PnetCDF tests using Intel MPI and OpenMPI on the same
>  machine
>  accessing the same Lustre file system. If this is the case, I am also
>  puzzled.
>  If OpenMPI works, then it implies the Lustre directory is mounted with the
>  'flock' option, which should have worked fine with Intel MPI. I would
>  suggest you
>  try a simple MPI-IO program below. If the same problem occurs, then it is
>  an
>  MPI-IO problem. Let me know.
>
>  {{{
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <mpi.h>
>
>  int main(int argc, char **argv) {
>      int            buf, err;
>      MPI_File       fh;
>      MPI_Status     status;
>
>      MPI_Init(&argc, &argv);
>      if (argc != 2) {
>          printf("Usage: %s filename\n", argv[0]);
>          MPI_Finalize();
>          return 1;
>      }
>      err = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_CREATE |
>  MPI_MODE_RDWR,
>                          MPI_INFO_NULL, &fh);
>      if (err != MPI_SUCCESS) printf("Error: MPI_File_open()\n");
>
>      err = MPI_File_write_all(fh, &buf, 1, MPI_INT, &status);
>      if (err != MPI_SUCCESS) printf("Error: MPI_File_write_all()\n");
>
>      MPI_File_close(&fh);
>      MPI_Finalize();
>      return 0;
>  }
>  }}}
>
>  Wei-keng
>
>
>  Replying to [ticket:24 luke.vanroekel@…]:
>  > In trying to respond to the question raised about my ticket #21, I am
>  unable to do so. I don't see any reply option
>  > or modify ticket. Sorry for raising another ticket, but I cannot figure
>  out how to respond to the previous question.
>  >
>  > In regards to the question in Ticket 21, the flag is not set for
>  locking. My confusion is why intel mpi requires file
>  > locking while openmpi does not. Our hpc staff will not change settings
>  on the mount. Is it possible to work
>  > around the file-lock error?
>  >
>  > Regards, Luke
>
>
>  Replying to [ticket:21 luke.vanroekel@…]:
>  > Hello,
>  >   I've been attempting to build parallel-netcdf for our local cluster
>  with gcc and intel-mpi 5.1.3 and netcdf 4.3.2.  The code compiles fine,
>  but when I run make check testing, nc_test fails with the following error
>  >
>  >
>  > {{{
>  > This requires fcntl(2) to be implemented. As of 8/25/2011 it is not.
>  Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 3,cmd
>  F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno
>  26.
>  > - If the file system is NFS, you need to use NFS version 3, ensure that
>  the lockd daemon is running on all the machines, and mount the directory
>  with the 'noac' option (no attribute caching).
>  > - If the file system is LUSTRE, ensure that the directory is mounted
>  with the 'flock' option.
>  > ADIOI_Set_lock:: Function not implemented
>  > ADIOI_Set_lock:offset 0, length 6076
>  > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>  >
>  > }}}
>  >
>  > I am running this test on a parallel file system (lustre).  I have
>  tested this in versions 1.5.0 up to the most current.  Any thoughts?  I
>  can compile and test just fine with openmpi 1.10.3.
>  >
>  > Regards,
>  > Luke
>
> --
> Ticket URL: <http://trac.mcs.anl.gov/projects/parallel-netcdf/
> ticket/21#comment:2>
> parallel-netcdf <http://trac.mcs.anl.gov/projects/parallel-netcdf>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20161028/8b697c53/attachment.html>


More information about the parallel-netcdf mailing list