MPI Failure at line 839 of nonblocking.c (MPI_File_write_all : MPI_ERR_IO: input/output error)

Wei-keng Liao wkliao at ece.northwestern.edu
Mon Sep 24 16:30:29 CDT 2012


Hi, Rob,

Jim also mentioned his program hung after seeing this MPI-IO error message.
In the current pnetcdf, at line 839 of nonblocking.c, all MPI error
will make the calling functions return immediately. A few lines below
at line 850 is an MPI_File_set_view() which is collective and hence causes
Jim's program to hang. Maybe we should remove the return statement from
line 41 of macro.h, so the program can continue for a non-fatal MPI error
such as this one.


Wei-keng

On Sep 24, 2012, at 2:53 PM, Rob Latham wrote:

> On Wed, Aug 08, 2012 at 02:17:19PM -0600, Jim Edwards wrote:
>> I am getting this error from parallel-netcdf using openmpi 1.4.5 and intel
>> 12.1.4 and a lustre filesystem.   Because this is
>> non-blocking I am having a lot of difficulty pinpointing the issue, do you
>> have any suggestions?  I buffer multiple variables before
>> calling the nfmpi_wait_all and if I turn off this buffering functionality
>> it appears to work fine.     All of this functionality works on several
>> other systems so I
>> think that it must be an issue lower in the software stack.
> 
> Hi Jim. Sorry to resurrect this old thread, especially when there's
> not a lot of new information for you.
> 
> Openmpi-1.5.2 (i think) contains a big ROMIO re-sync, including some
> Lustre collective I/O improvements:  your hunch that the problem lies
> with a lower level in the software stack (the MPI-IO library) is
> entirely consistent with that observation. 
> 
> ==rob
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA



More information about the parallel-netcdf mailing list