pnetcdf and mvapich2 2.2

Rob Latham robl at mcs.anl.gov
Thu Feb 9 13:23:21 CST 2017



On 02/09/2017 01:07 PM, Wei-keng Liao wrote:
> Hi, Rob
>
> I have filed a bug report to mvapich and they provided a patch the next day (impressive!).
> Mark will give the patch a try and get back to them.
> Please see the bug report in discussion thread from the mvapich list.
> http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2017-February/006300.html

Great! looks like this is indeed the patch they picked up.

==rob

>
> Wei-keng
>
> On Feb 9, 2017, at 12:58 PM, Rob Latham wrote:
>
>>
>>
>> On 02/03/2017 05:21 PM, Wei-keng Liao wrote:
>>> Hi, Mark
>>>
>>> The Lustre driver in mvapich2 appears to append O_CREAT to the open mode,
>>> (line 50, in file src/mpi/romio/adio/ad_lustre/ad_lustre_open.c), even
>>> if the file is open for read-only. This is the root cause of one of the
>>> error messages you are seeing:
>>>   "expect error code NC_ENOENT but got NC_ENOTNC"
>>>
>>> Attached is a small MPI program to verify such error. Could you please
>>> give it a try on your mvapich2 build and Lustre?
>>> Compile command:
>>>    mpicc test_open_no_such_file.c -o test_open_no_such_file
>>> Run command:
>>>    mpiexec -n 1 ./test_open_no_such_file /lustre/path/non-exist-file
>>>
>>> If it is indeed an internal issue of mvapich, I can file a bug report to them.
>>> thanks
>>
>> This bug affected both Panasas and Lustre.  I wanted to confirm that I fixed this in ROMIO's latest:
>>
>> $ ./test_open_no_such_file  /mnt/lustre/robl/no_such_file
>> non-exiting file "/mnt/lustre/robl/no_such_file"
>> MPI error string: File does not exist, error stack:
>> ADIOI_LUSTRE_OPEN(42): File /mnt/lustre/robl/no_such_file does not exist
>> Error class = MPI_ERR_NO_SUCH_FILE
>> [robl at centos6 ~]$ ./test_open_no_such_file  no_such_file
>> non-exiting file "no_such_file"
>> MPI error string: File does not exist, error stack:
>> ADIOI_UFS_OPEN(42): File no_such_file does not exist
>> Error class = MPI_ERR_NO_SUCH_FILE
>>
>> If you report this to mvapich, you might want to point them to this not-so-recent change: http://git.mpich.org/mpich.git/commit/92f1c69f0de87f93
>>
>> It might have taken a few commits for me to get this 100% correct.  I hope they just take latest ROMIO/MPICH.
>>
>> ==rob
>>
>>
>>>
>>> Wei-keng
>>>
>>> On Feb 3, 2017, at 12:25 PM, Wei-keng Liao wrote:
>>>
>>>> Hi, Mark
>>>>
>>>> For running "make check" on Lustre, could you please set the environment
>>>> variable PNETCDF_HINTS to "nc_header_align_size=512;nc_var_align_size=1"
>>>> and run "make check" again? I think it should pass make check. Do let me
>>>> know. These errors only occur for file systems whose striping size is
>>>> larger than 1. So, ext4 is not affected. I am working on a fix for that
>>>> test program. Please note this is a bug in the test program. the PnetCDF
>>>> library itself is intact.
>>>>
>>>> When running "make check", I suggest not to set the environment variable
>>>> PNETCDF_VERBOSE_DEBUG_MODE, as many error checks are designed on
>>>> purpose. Those debugging messages can easily mask the true errors. That
>>>> environment variable is designed for testing one program at a time.
>>>>
>>>> As for the errors from mvapich2, I do not have access to a machine with
>>>> infiniband and thus could not give it a try. However, the errors look like
>>>> a similar issue that has been discovered in OpenMPI recently: fail to
>>>> return the correct MPI error codes. I will look into the mvapich2 source
>>>> codes to confirm.
>>>>
>>>> Thanks for trying various compilers and reporting the problem !
>>>>
>>>> Wei-keng
>>>>
>>>
>


More information about the parallel-netcdf mailing list