Problem opening a file under OpenMPI

John Michalakes john.michalakes at noaa.gov
Fri Jul 10 15:14:39 CDT 2015


Hi Wei-keng and Jim,

The MPI-IO test program you works on both one and more than one node.  I had
to add a call to MPI_File_close before the MPI_Finalize to get rid of some
extraneous errors related to shutdown:

   [a514:22054] *** An error occurred in MPI_File_set_errhandler
   [a514:22054] *** on a NULL communicator
   [a514:22054] *** Unknown error
   [a514:22054] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

But once I did that the output from the program is clean, no error messages.

I then downloaded and installed pNetCDF 1.6.1 on the user's machine and
tried my Fortran code again. Success!  

So whatever the problem was, upgrading to pNetCDF 1.6.1 seems to have fixed
things.  Thanks for your help.

John 





-----Original Message-----
From: Wei-keng Liao [mailto:wkliao at eecs.northwestern.edu] 
Sent: Friday, July 10, 2015 1:00 PM
To: John Michalakes
Cc: parallel-netcdf at lists.mcs.anl.gov
Subject: Re: Problem opening a file under OpenMPI

Hi, John

Can you try the following Fortran MPI program to see if you can create a
file?
Please use the same OpenMPI 1.6.3 compiler to test and maybe with a
different file path.

% cat mpi_open.f
      program mpi_open
      implicit none
      include "mpif.h"

      character(LEN=MPI_MAX_ERROR_STRING) err_string
      integer err, ierr, err_len, errorclass, fp, omode

      call MPI_INIT(err)
      
      omode = IOR(MPI_MODE_RDWR, MPI_MODE_CREATE)
      call MPI_File_open(MPI_COMM_WORLD, 'testfile_d01', omode,
     +                   MPI_INFO_NULL, fp, err)
      if (err .NE. MPI_SUCCESS) then
          call MPI_Error_class(err, errorclass, ierr)
          call MPI_Error_string(err, err_string, err_len, ierr)
          print*,'Error: MPI_File_open() ' , trim(err_string)
      endif

      call MPI_Finalize(err)
      end

PnetCDF version 1.3.1 is old, released 3 years ago.
Error code -208 in 1.3.1 means "file open/creation failed".

When seeing this error code, OpenMPI should also report another error
message that provides more information. OpenMPI 1.6.3 is also kind of old,
also 3 years.

If the above test program ran without errors, I wonder if you can try the
latest PnetCDF 1.6.1 on that machine?


Wei-keng

On Jul 10, 2015, at 1:27 PM, John Michalakes wrote:

> Hi,
>  
> Having a problem where an MPI Fortran program (WRF) can open a file for
writing using NFMPI_CREATE when all tasks are on one node, but fails if the
tasks are spread over multiple nodes.  Have isolated to a small test
program:
>  
> Program hello
>   implicit none
>   include "mpif.h"
> #include "pnetcdf.inc"
>   integer                           :: stat,Status
>   integer                           :: info, ierr
>   integer Comm
>   integer ncid
>  
>   CALL MPI_INIT( ierr )
>   Comm = MPI_COMM_WORLD
>   call mpi_info_create( info, ierr )
>   CALL mpi_info_set(info,"romio_ds_write","disable", ierr) ;
> write(0,*)'mpi_info_set write returns ',ierr
>   CALL mpi_info_set(info,"romio_ds_read","disable", ierr) ;
> write(0,*)'mpi_info_set read returns ',ierr
>   stat = NFMPI_CREATE(Comm, 'testfile_d01', IOR(NF_CLOBBER,
NF_64BIT_OFFSET), info, NCID)
> write(0,*)'after NFMPI_CREATE ', stat
>   call mpi_info_free( info, ierr )
>   stat = NFMPI_CLOSE(NCID)
> write(0,*)'after NFMPI_CLOSE ', stat
>   CALL MPI_FINALIZE( ierr )
>   STOP
> End Program hello
> Running with two tasks on a single node this generates:
>  
> a515
> a515
> mpi_info_set write returns            0
> mpi_info_set read returns            0
> mpi_info_set write returns            0
> mpi_info_set read returns            0
> after NFMPI_CREATE            0
> after NFMPI_CREATE            0
> after NFMPI_CLOSE            0
> after NFMPI_CLOSE            0
>  
> But running with 2 tasks, each on a separate node:
>  
> a811
> a817
> mpi_info_set write returns            0
> mpi_info_set read returns            0
> mpi_info_set write returns            0
> mpi_info_set read returns            0
> after NFMPI_CREATE         -208   <<<<<<<<<<<<<<
> after NFMPI_CLOSE          -33
> after NFMPI_CREATE         -208  <<<<<<<<<<<<<<
> after NFMPI_CLOSE          -33
>  
> I have tested the program on other systems such as NCAR's Yellowstone and
it works fine on any combination of nodes.  This target system is a user's
system running openmpi/1.6.3 compiled for intel.  The version of pnetcdf is
1.3.1.  I'm pretty sure it's aLustre file system (but will have to follow up
with the user and their support staff to be sure).
>  
> I'm assuming there's some misconfiguration or installation of MPI or
pNetCDF on the user's system, but I need some help with how to proceed.
Thanks,
>  
> John
>  
> John Michalakes
> Scientific Programmer/Analyst
> National Centers for Environmental Prediction
> john.michalakes at noaa.gov
> 301-683-3847
>  



More information about the parallel-netcdf mailing list