[parallel-netcdf] #21: File system locking error in testing
Wei-keng Liao
wkliao at eecs.northwestern.edu
Mon Oct 31 10:01:34 CDT 2016
Hi, Luke
If the output Lustre folder is the same for both runs built by
Intel MPI and OpenMPI runs, then I would say most likely the
Intel MPI configuration is not done correctly. I suggest you
replort this error to your system admin with the simple MPI
program I provided earlier.
If you like, you can also post this to mpich discuss mailing
list: <discuss at mpich.org>. Rob Latham is the lead developer
of ROMIO (MPICH's MPI-IO component). He and others in MPICH team
may provide more information.
Wei-keng
On Oct 31, 2016, at 9:45 AM, Luke Van Roekel wrote:
> Wei-keng,
> You were right about the mismatch. With the fix, I know get the same ADIOI_set_lock error as in my first submission. With openmpi the program runs fine.
>
> Regards,
> Luke
>
> On Sun, Oct 30, 2016 at 10:28 PM, Wei-keng Liao <wkliao at eecs.northwestern.edu> wrote:
> Hi, Luke
>
> The error message could be caused by using a mpiexec/mpirun that is
> not of the same build as mpicc used to compile the MPI program.
> Could you check the path of mpiexec/mpirun to see whether it is in the
> same folder as the Intel mpicc? However, this dose not seem to relate
> to the ADIOI_Set_lock problem you first reported. But do let me know
> if you get the above mpirun issue resolved and then we can check the lock
> problem after.
>
> Wei-keng
>
> On Oct 28, 2016, at 10:23 PM, Luke Van Roekel wrote:
>
> > Hello Wei-Keng,
> > Sorry for the slow turn around on this test. Our computing resources have been down all week and just came back. Openmpi succeeded, but intel-mpi failed with the following error.
> >
> > [proxy:0:0 at gr1224.localdomain] HYD_pmcd_pmi_args_to_tokens (../../pm/pmiserv/common.c:276): assert (*count * sizeof(struct HYD_pmcd_token)) failed
> > [proxy:0:0 at gr1224.localdomain] fn_job_getid (../../pm/pmiserv/pmip_pmi_v2.c:253): unable to convert args to tokens
> > [proxy:0:0 at gr1224.localdomain] pmi_cb (../../pm/pmiserv/pmip_cb.c:806): PMI handler returned error
> > [proxy:0:0 at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> > [proxy:0:0 at gr1224.localdomain] main (../../pm/pmiserv/pmip.c:507): demux engine error waiting for event
> > [mpiexec at gr1224.localdomain] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host gr1224 failed
> > [mpiexec at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> > [mpiexec at gr1224.localdomain] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
> > [mpiexec at gr1224.localdomain] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
> >
> > Does this mean that our intel-mpi implementation has an issue(s)?
> > Regards,
> > Luke
> >
> > On Mon, Oct 24, 2016 at 11:02 PM, parallel-netcdf <parallel-netcdf at mcs.anl.gov> wrote:
> > #21: File system locking error in testing
> > --------------------------------------+-------------------------------------
> > Reporter: luke.vanroekel@… | Owner: robl
> > Type: test error | Status: new
> > Priority: major | Milestone:
> > Component: parallel-netcdf | Version: 1.7.0
> > Keywords: |
> > --------------------------------------+-------------------------------------
> >
> > Comment(by wkliao):
> >
> > Hi, Luke
> >
> > We just resolved an issue of trac notification email setting today. I
> > believe from now on
> > any update to the ticket you created should reach you through email.
> >
> > I assume you ran PnetCDF tests using Intel MPI and OpenMPI on the same
> > machine
> > accessing the same Lustre file system. If this is the case, I am also
> > puzzled.
> > If OpenMPI works, then it implies the Lustre directory is mounted with the
> > 'flock' option, which should have worked fine with Intel MPI. I would
> > suggest you
> > try a simple MPI-IO program below. If the same problem occurs, then it is
> > an
> > MPI-IO problem. Let me know.
> >
> > {{{
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <mpi.h>
> >
> > int main(int argc, char **argv) {
> > int buf, err;
> > MPI_File fh;
> > MPI_Status status;
> >
> > MPI_Init(&argc, &argv);
> > if (argc != 2) {
> > printf("Usage: %s filename\n", argv[0]);
> > MPI_Finalize();
> > return 1;
> > }
> > err = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_CREATE |
> > MPI_MODE_RDWR,
> > MPI_INFO_NULL, &fh);
> > if (err != MPI_SUCCESS) printf("Error: MPI_File_open()\n");
> >
> > err = MPI_File_write_all(fh, &buf, 1, MPI_INT, &status);
> > if (err != MPI_SUCCESS) printf("Error: MPI_File_write_all()\n");
> >
> > MPI_File_close(&fh);
> > MPI_Finalize();
> > return 0;
> > }
> > }}}
> >
> > Wei-keng
> >
> >
> > Replying to [ticket:24 luke.vanroekel@…]:
> > > In trying to respond to the question raised about my ticket #21, I am
> > unable to do so. I don't see any reply option
> > > or modify ticket. Sorry for raising another ticket, but I cannot figure
> > out how to respond to the previous question.
> > >
> > > In regards to the question in Ticket 21, the flag is not set for
> > locking. My confusion is why intel mpi requires file
> > > locking while openmpi does not. Our hpc staff will not change settings
> > on the mount. Is it possible to work
> > > around the file-lock error?
> > >
> > > Regards, Luke
> >
> >
> > Replying to [ticket:21 luke.vanroekel@…]:
> > > Hello,
> > > I've been attempting to build parallel-netcdf for our local cluster
> > with gcc and intel-mpi 5.1.3 and netcdf 4.3.2. The code compiles fine,
> > but when I run make check testing, nc_test fails with the following error
> > >
> > >
> > > {{{
> > > This requires fcntl(2) to be implemented. As of 8/25/2011 it is not.
> > Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 3,cmd
> > F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno
> > 26.
> > > - If the file system is NFS, you need to use NFS version 3, ensure that
> > the lockd daemon is running on all the machines, and mount the directory
> > with the 'noac' option (no attribute caching).
> > > - If the file system is LUSTRE, ensure that the directory is mounted
> > with the 'flock' option.
> > > ADIOI_Set_lock:: Function not implemented
> > > ADIOI_Set_lock:offset 0, length 6076
> > > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> > >
> > > }}}
> > >
> > > I am running this test on a parallel file system (lustre). I have
> > tested this in versions 1.5.0 up to the most current. Any thoughts? I
> > can compile and test just fine with openmpi 1.10.3.
> > >
> > > Regards,
> > > Luke
> >
> > --
> > Ticket URL: <http://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/21#comment:2>
> > parallel-netcdf <http://trac.mcs.anl.gov/projects/parallel-netcdf>
> >
> >
>
>
More information about the parallel-netcdf
mailing list