[EXTERNAL] Strange behavior in ncmpio_file_set_view

Wei-Keng Liao wkliao at northwestern.edu
Fri Oct 29 12:20:23 CDT 2021


I also notice these comments in your test program.
    /*
     * Variable `attrib1` is of size 4 x 7
     * Rank 0 is reading all values of one `row`
     * Rank 1 is reading no values of the row
     */
I assume 'row' here is the Fortran row (column in C)
and thus a 'row' is of size 4 elements.

If this is the case, you do not need to call vars API.
Calling vara API is sufficient. Just FYI.


Wei-keng

On Oct 29, 2021, at 12:08 PM, Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:

Hi, Greg

I could not reproduce the error. Here is what I am using to compile.
clang 11.0.0
OpenMPI 4.0.5
PnetCDF 1.12.1

Could you try changing the stride argument in the call to
ncmpi_get_vars_double_all() to NULL?
This can help determine whether the problem is limited to vars APIs.

Did you get a core dump? I wonder what line of code in OpenMPI that
causes the error.

Wei-keng

On Oct 29, 2021, at 9:46 AM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:

Attached is a small program and file that seems to reproduce the issue I am seeing. I think the program is correct...
..Greg

On 10/28/21, 4:16 PM, "parallel-netcdf on behalf of Sjaardema, Gregory D" <parallel-netcdf-bounces at lists.mcs.anl.gov<mailto:parallel-netcdf-bounces at lists.mcs.anl.gov> on behalf of gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:

   By changing the variable at the netCDF level from NC_INDEPENDENT to NC_COLLECTIVE, it ended up hitting the `NC_REC_COLL` test in get_varm.

   I downloaded and compiled the crash_mpiio.txt file and it runs correctly. I then recompiled using an openmpi-4.0.1 and I do get the crash, so fairly confident that the openmpi I am using has the fix applied.  I will try to create a short program fragment to reproduce the crash...

   ..Greg

   On 10/28/21, 4:06 PM, "Wei-Keng Liao" <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:


       Both constants NC_REQ_INDEP and NC_REQ_COLL are defined internally in PnetCDF.
       They are not visible to user applications. I wonder how you switch them.
       In PnetCDF, NC_REQ_INDEP is used when the program is in independent mode
       and NC_REQ_COLL in collective mode. Users can switch mode by calling
       ncmpi_begin_indep_data() and ncmpi_end_indep_data().

       Googling keyword "cost_calc" leads me to this page.
       https://urldefense.com/v3/__https://github.com/open-mpi/ompi/issues/6758__;!!Dq0X2DkFhyF93HkjWTBQKhk!ALyAxK9udKpK5MPE-DuKoo6WbnxdOkDYEAGIZ89aALtideW7V98jmfNnNH7ekWmtn0G5$
       Where it has a test program that can produce the error: crash_mpiio.txt
       Maybe you can give it a try to see if the openmpi-4.0.5 you are using has
       incorporated the fix?

       If that is not the case, could you provide me a short program or a code
       fragment?

       Wei-keng

On Oct 28, 2021, at 4:24 PM, Sjaardema, Gregory D <gdsjaar at sandia.gov<mailto:gdsjaar at sandia.gov>> wrote:

I am getting a floating point exception core dump down below `ncmpio_file_set_view` with certain compilers…

I’ve been trying to trace it down, but am confused by the code in `get_varm` for the case when one rank has no items to get and the other rank has items to read with a non-unity stride (7 in this case).
This is from a code using netCDF to call down into PnetCDF.

Originally, the variable being read was NC_REQ_INDEP, so the rank with zero items to read would return from `get_varm` at line 464 and the other rank would continue.  It would eventually end up in `ncmpio_file_set_view` and call down and finally throw the floating point exception.

Since there is a comment that `MPI_File_set_view` is collective, I figured that might be the issue that only one rank was calling down that path, so changed the variable to be NC_REC_COLL.  Both ranks now call down into `ncmpio_file_set_view`, but then inside that routine, the rank with zero bytes to read falls into the first if block `if (filetype == MPI_BYTE)` and the second rank goes down further and hits the next if block `if (rank == 0) `.

Both ranks end up calling a MPI_File_set_view, but with different types.  The end result is that I still get a floating point exception on the rank that does have bytes to read.  The execption seems to be in `cost_calc`.

This is with pnetcdf-1.12.1, clang-12.0.0 (also with clang-10.0.0) and openmpi-4.0.5.

I’m basically looking for guidance at this point that the calling paths look correct or where to look in more depth…   Any help appreciated.

..GReg



<attrib.nc><attrib-test.c>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20211029/4d31e938/attachment-0001.html>


More information about the parallel-netcdf mailing list