[parallel-netcdf] #21: File system locking error in testing
Wei-keng Liao
wkliao at eecs.northwestern.edu
Sun Oct 30 23:28:47 CDT 2016
Hi, Luke
The error message could be caused by using a mpiexec/mpirun that is
not of the same build as mpicc used to compile the MPI program.
Could you check the path of mpiexec/mpirun to see whether it is in the
same folder as the Intel mpicc? However, this dose not seem to relate
to the ADIOI_Set_lock problem you first reported. But do let me know
if you get the above mpirun issue resolved and then we can check the lock
problem after.
Wei-keng
On Oct 28, 2016, at 10:23 PM, Luke Van Roekel wrote:
> Hello Wei-Keng,
> Sorry for the slow turn around on this test. Our computing resources have been down all week and just came back. Openmpi succeeded, but intel-mpi failed with the following error.
>
> [proxy:0:0 at gr1224.localdomain] HYD_pmcd_pmi_args_to_tokens (../../pm/pmiserv/common.c:276): assert (*count * sizeof(struct HYD_pmcd_token)) failed
> [proxy:0:0 at gr1224.localdomain] fn_job_getid (../../pm/pmiserv/pmip_pmi_v2.c:253): unable to convert args to tokens
> [proxy:0:0 at gr1224.localdomain] pmi_cb (../../pm/pmiserv/pmip_cb.c:806): PMI handler returned error
> [proxy:0:0 at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at gr1224.localdomain] main (../../pm/pmiserv/pmip.c:507): demux engine error waiting for event
> [mpiexec at gr1224.localdomain] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 0 at host gr1224 failed
> [mpiexec at gr1224.localdomain] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
> [mpiexec at gr1224.localdomain] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
> [mpiexec at gr1224.localdomain] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
>
> Does this mean that our intel-mpi implementation has an issue(s)?
> Regards,
> Luke
>
> On Mon, Oct 24, 2016 at 11:02 PM, parallel-netcdf <parallel-netcdf at mcs.anl.gov> wrote:
> #21: File system locking error in testing
> --------------------------------------+-------------------------------------
> Reporter: luke.vanroekel@… | Owner: robl
> Type: test error | Status: new
> Priority: major | Milestone:
> Component: parallel-netcdf | Version: 1.7.0
> Keywords: |
> --------------------------------------+-------------------------------------
>
> Comment(by wkliao):
>
> Hi, Luke
>
> We just resolved an issue of trac notification email setting today. I
> believe from now on
> any update to the ticket you created should reach you through email.
>
> I assume you ran PnetCDF tests using Intel MPI and OpenMPI on the same
> machine
> accessing the same Lustre file system. If this is the case, I am also
> puzzled.
> If OpenMPI works, then it implies the Lustre directory is mounted with the
> 'flock' option, which should have worked fine with Intel MPI. I would
> suggest you
> try a simple MPI-IO program below. If the same problem occurs, then it is
> an
> MPI-IO problem. Let me know.
>
> {{{
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
>
> int main(int argc, char **argv) {
> int buf, err;
> MPI_File fh;
> MPI_Status status;
>
> MPI_Init(&argc, &argv);
> if (argc != 2) {
> printf("Usage: %s filename\n", argv[0]);
> MPI_Finalize();
> return 1;
> }
> err = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_CREATE |
> MPI_MODE_RDWR,
> MPI_INFO_NULL, &fh);
> if (err != MPI_SUCCESS) printf("Error: MPI_File_open()\n");
>
> err = MPI_File_write_all(fh, &buf, 1, MPI_INT, &status);
> if (err != MPI_SUCCESS) printf("Error: MPI_File_write_all()\n");
>
> MPI_File_close(&fh);
> MPI_Finalize();
> return 0;
> }
> }}}
>
> Wei-keng
>
>
> Replying to [ticket:24 luke.vanroekel@…]:
> > In trying to respond to the question raised about my ticket #21, I am
> unable to do so. I don't see any reply option
> > or modify ticket. Sorry for raising another ticket, but I cannot figure
> out how to respond to the previous question.
> >
> > In regards to the question in Ticket 21, the flag is not set for
> locking. My confusion is why intel mpi requires file
> > locking while openmpi does not. Our hpc staff will not change settings
> on the mount. Is it possible to work
> > around the file-lock error?
> >
> > Regards, Luke
>
>
> Replying to [ticket:21 luke.vanroekel@…]:
> > Hello,
> > I've been attempting to build parallel-netcdf for our local cluster
> with gcc and intel-mpi 5.1.3 and netcdf 4.3.2. The code compiles fine,
> but when I run make check testing, nc_test fails with the following error
> >
> >
> > {{{
> > This requires fcntl(2) to be implemented. As of 8/25/2011 it is not.
> Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 3,cmd
> F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno
> 26.
> > - If the file system is NFS, you need to use NFS version 3, ensure that
> the lockd daemon is running on all the machines, and mount the directory
> with the 'noac' option (no attribute caching).
> > - If the file system is LUSTRE, ensure that the directory is mounted
> with the 'flock' option.
> > ADIOI_Set_lock:: Function not implemented
> > ADIOI_Set_lock:offset 0, length 6076
> > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> >
> > }}}
> >
> > I am running this test on a parallel file system (lustre). I have
> tested this in versions 1.5.0 up to the most current. Any thoughts? I
> can compile and test just fine with openmpi 1.10.3.
> >
> > Regards,
> > Luke
>
> --
> Ticket URL: <http://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/21#comment:2>
> parallel-netcdf <http://trac.mcs.anl.gov/projects/parallel-netcdf>
>
>
More information about the parallel-netcdf
mailing list