[parallel-netcdf] #21: File system locking error in testing

Sun Oct 30 23:28:47 CDT 2016

Hi, Luke

The error message could be caused by using a mpiexec/mpirun that is
not of the same build as mpicc used to compile the MPI program.
Could you check the path of mpiexec/mpirun to see whether it is in the
same folder as the Intel mpicc? However, this dose not seem to relate
to the ADIOI_Set_lock problem you first reported. But do let me know
if you get the above mpirun issue resolved and then we can check the lock
problem after.

Wei-keng

On Oct 28, 2016, at 10:23 PM, Luke Van Roekel wrote:

> Hello Wei-Keng,
>   Sorry for the slow turn around on this test.  Our computing resources have been down all week and just came back.  Openmpi succeeded, but intel-mpi failed with the following error.
> 
> [proxy:0:0 at gr1224.localdomain] HYD_pmcd_pmi_args_to_tokens (../../pm/pmiserv/common.c:276): assert (*count * sizeof(struct HYD_pmcd_token)) failed
> [proxy:0:0 at gr1224.localdomain] fn_job_getid (../../pm/pmiserv/pmip_pmi_v2.c:253): unable to convert args to tokens
> [proxy:0:0 at gr1224.localdomain] pmi_cb (../../pm/pmiserv/pmip_cb.c:806): PMI handler returned error
> [proxy:0:0 at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at gr1224.localdomain] main (../../pm/pmiserv/pmip.c:507): demux engine error waiting for event
> [mpiexec at gr1224.localdomain] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host gr1224 failed
> [mpiexec at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> [mpiexec at gr1224.localdomain] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
> [mpiexec at gr1224.localdomain] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
> 
> Does this mean that our intel-mpi implementation has an issue(s)?
> Regards,
> Luke
> 
> On Mon, Oct 24, 2016 at 11:02 PM, parallel-netcdf <parallel-netcdf at mcs.anl.gov> wrote:
> #21: File system locking error in testing
> --------------------------------------+-------------------------------------
>  Reporter:  luke.vanroekel@…          |       Owner:  robl
>      Type:  test error                |      Status:  new
>  Priority:  major                     |   Milestone:
> Component:  parallel-netcdf           |     Version:  1.7.0
>  Keywords:                            |
> --------------------------------------+-------------------------------------
> 
> Comment(by wkliao):
> 
>  Hi, Luke
> 
>  We just resolved an issue of trac notification email setting today. I
>  believe from now on
>  any update to the ticket you created should reach you through email.
> 
>  I assume you ran PnetCDF tests using Intel MPI and OpenMPI on the same
>  machine
>  accessing the same Lustre file system. If this is the case, I am also
>  puzzled.
>  If OpenMPI works, then it implies the Lustre directory is mounted with the
>  'flock' option, which should have worked fine with Intel MPI. I would
>  suggest you
>  try a simple MPI-IO program below. If the same problem occurs, then it is
>  an
>  MPI-IO problem. Let me know.
> 
>  {{{
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <mpi.h>
> 
>  int main(int argc, char **argv) {
>      int            buf, err;
>      MPI_File       fh;
>      MPI_Status     status;
> 
>      MPI_Init(&argc, &argv);
>      if (argc != 2) {
>          printf("Usage: %s filename\n", argv[0]);
>          MPI_Finalize();
>          return 1;
>      }
>      err = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_CREATE |
>  MPI_MODE_RDWR,
>                          MPI_INFO_NULL, &fh);
>      if (err != MPI_SUCCESS) printf("Error: MPI_File_open()\n");
> 
>      err = MPI_File_write_all(fh, &buf, 1, MPI_INT, &status);
>      if (err != MPI_SUCCESS) printf("Error: MPI_File_write_all()\n");
> 
>      MPI_File_close(&fh);
>      MPI_Finalize();
>      return 0;
>  }
>  }}}
> 
>  Wei-keng
> 
> 
>  Replying to [ticket:24 luke.vanroekel@…]:
>  > In trying to respond to the question raised about my ticket #21, I am
>  unable to do so. I don't see any reply option
>  > or modify ticket. Sorry for raising another ticket, but I cannot figure
>  out how to respond to the previous question.
>  >
>  > In regards to the question in Ticket 21, the flag is not set for
>  locking. My confusion is why intel mpi requires file
>  > locking while openmpi does not. Our hpc staff will not change settings
>  on the mount. Is it possible to work
>  > around the file-lock error?
>  >
>  > Regards, Luke
> 
> 
>  Replying to [ticket:21 luke.vanroekel@…]:
>  > Hello,
>  >   I've been attempting to build parallel-netcdf for our local cluster
>  with gcc and intel-mpi 5.1.3 and netcdf 4.3.2.  The code compiles fine,
>  but when I run make check testing, nc_test fails with the following error
>  >
>  >
>  > {{{
>  > This requires fcntl(2) to be implemented. As of 8/25/2011 it is not.
>  Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 3,cmd
>  F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno
>  26.
>  > - If the file system is NFS, you need to use NFS version 3, ensure that
>  the lockd daemon is running on all the machines, and mount the directory
>  with the 'noac' option (no attribute caching).
>  > - If the file system is LUSTRE, ensure that the directory is mounted
>  with the 'flock' option.
>  > ADIOI_Set_lock:: Function not implemented
>  > ADIOI_Set_lock:offset 0, length 6076
>  > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>  >
>  > }}}
>  >
>  > I am running this test on a parallel file system (lustre).  I have
>  tested this in versions 1.5.0 up to the most current.  Any thoughts?  I
>  can compile and test just fine with openmpi 1.10.3.
>  >
>  > Regards,
>  > Luke
> 
> --
> Ticket URL: <http://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/21#comment:2>
> parallel-netcdf <http://trac.mcs.anl.gov/projects/parallel-netcdf>
> 
>