Problem opening a file under OpenMPI
Wei-keng Liao
wkliao at eecs.northwestern.edu
Fri Jul 10 14:00:13 CDT 2015
Hi, John
Can you try the following Fortran MPI program to see if you can create a file?
Please use the same OpenMPI 1.6.3 compiler to test and maybe with a different file path.
% cat mpi_open.f
program mpi_open
implicit none
include "mpif.h"
character(LEN=MPI_MAX_ERROR_STRING) err_string
integer err, ierr, err_len, errorclass, fp, omode
call MPI_INIT(err)
omode = IOR(MPI_MODE_RDWR, MPI_MODE_CREATE)
call MPI_File_open(MPI_COMM_WORLD, 'testfile_d01', omode,
+ MPI_INFO_NULL, fp, err)
if (err .NE. MPI_SUCCESS) then
call MPI_Error_class(err, errorclass, ierr)
call MPI_Error_string(err, err_string, err_len, ierr)
print*,'Error: MPI_File_open() ' , trim(err_string)
endif
call MPI_Finalize(err)
end
PnetCDF version 1.3.1 is old, released 3 years ago.
Error code -208 in 1.3.1 means "file open/creation failed".
When seeing this error code, OpenMPI should also report another error message that provides more information. OpenMPI 1.6.3 is also kind of old, also 3 years.
If the above test program ran without errors, I wonder if you can try the latest PnetCDF 1.6.1 on that machine?
Wei-keng
On Jul 10, 2015, at 1:27 PM, John Michalakes wrote:
> Hi,
>
> Having a problem where an MPI Fortran program (WRF) can open a file for writing using NFMPI_CREATE when all tasks are on one node, but fails if the tasks are spread over multiple nodes. Have isolated to a small test program:
>
> Program hello
> implicit none
> include "mpif.h"
> #include "pnetcdf.inc"
> integer :: stat,Status
> integer :: info, ierr
> integer Comm
> integer ncid
>
> CALL MPI_INIT( ierr )
> Comm = MPI_COMM_WORLD
> call mpi_info_create( info, ierr )
> CALL mpi_info_set(info,"romio_ds_write","disable", ierr) ;
> write(0,*)'mpi_info_set write returns ',ierr
> CALL mpi_info_set(info,"romio_ds_read","disable", ierr) ;
> write(0,*)'mpi_info_set read returns ',ierr
> stat = NFMPI_CREATE(Comm, 'testfile_d01', IOR(NF_CLOBBER, NF_64BIT_OFFSET), info, NCID)
> write(0,*)'after NFMPI_CREATE ', stat
> call mpi_info_free( info, ierr )
> stat = NFMPI_CLOSE(NCID)
> write(0,*)'after NFMPI_CLOSE ', stat
> CALL MPI_FINALIZE( ierr )
> STOP
> End Program hello
> Running with two tasks on a single node this generates:
>
> a515
> a515
> mpi_info_set write returns 0
> mpi_info_set read returns 0
> mpi_info_set write returns 0
> mpi_info_set read returns 0
> after NFMPI_CREATE 0
> after NFMPI_CREATE 0
> after NFMPI_CLOSE 0
> after NFMPI_CLOSE 0
>
> But running with 2 tasks, each on a separate node:
>
> a811
> a817
> mpi_info_set write returns 0
> mpi_info_set read returns 0
> mpi_info_set write returns 0
> mpi_info_set read returns 0
> after NFMPI_CREATE -208 <<<<<<<<<<<<<<
> after NFMPI_CLOSE -33
> after NFMPI_CREATE -208 <<<<<<<<<<<<<<
> after NFMPI_CLOSE -33
>
> I have tested the program on other systems such as NCAR’s Yellowstone and it works fine on any combination of nodes. This target system is a user’s system running openmpi/1.6.3 compiled for intel. The version of pnetcdf is 1.3.1. I’m pretty sure it’s aLustre file system (but will have to follow up with the user and their support staff to be sure).
>
> I’m assuming there’s some misconfiguration or installation of MPI or pNetCDF on the user’s system, but I need some help with how to proceed. Thanks,
>
> John
>
> John Michalakes
> Scientific Programmer/Analyst
> National Centers for Environmental Prediction
> john.michalakes at noaa.gov
> 301-683-3847
>
More information about the parallel-netcdf
mailing list