error in enddef

Jim Edwards jedwards at ucar.edu
Tue Jun 28 11:38:21 CDT 2022


I haven't looked at the pattern for this case, but I suspect that it does.
It's an mpas hexagonal mesh grid.   I'll look into it whether this ROMIO
fix is in a more
recent impi version - I'm currently using 19.0.9

On Tue, Jun 28, 2022 at 9:52 AM Wei-Keng Liao <wkliao at northwestern.edu>
wrote:

> Hi, Jim
>
> Thanks for the update.
>
> I am wondering if your I/O pattern produces a large number
> of noncontiguous file access requests in each MPI process.
> Because ROMIO used MPI tags in its implementation of 2-phase I/O,
> this pattern can result in a large number of MPI isend/irecv,
> each uses a unique MPI tag. The latest ROMIO has fixed this for
> Lustre (https://github.com/pmodels/mpich/pull/5660.)
>
> Wei-keng
>
> On Jun 28, 2022, at 9:41 AM, Jim Edwards <jedwards at ucar.edu> wrote:
>
> Hi Wei-Keng,
>
> I found the issue with help from TACC user support:
>
> https://www.intel.com/content/www/us/en/developer/articles/technical/large-mpi-tags-with-the-intel-mpi.html
> <https://urldefense.com/v3/__https://www.intel.com/content/www/us/en/developer/articles/technical/large-mpi-tags-with-the-intel-mpi.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!VT-ST0BuSkBPBDMQPJpRTv8AHEeTsBLq5E4LV_kjmYjFUDJoc3Z0jCGiDj3ItqkQ5t8Pgm2wZa23ViXRQ8jKf48$>
>
> I set Setting Environment MPIR_CVAR_CH4_OFI_RANK_BITS=15
>    Setting Environment MPIR_CVAR_CH4_OFI_TAG_BITS=24
> and added a print statement:
> cam_restart.F90     123 Maximum tag value queried   8388607
> this appears to be working.
>
>
> On Tue, Jun 21, 2022 at 7:25 PM Wei-Keng Liao <wkliao at northwestern.edu>
> wrote:
>
>> Hi, Jim
>>
>> Is the ncmpi_enddef the first enddef call after the file creation,
>> or after a ncmpi_redef?
>>
>> In the former case, there is no MPI communication in PnetCDF, except
>> for an MPI_Barrier. In the latter case, if the file header size expands,
>> existing variables need to be moved to higher offsets, which require
>> PnetCDF to call MPI collective reads and writes and thus leads to
>> MPI_Issend.
>>
>> Can you try to get a coredump so to trace the call stacks?
>>
>> You can also enable PnetCDF safe mode which will make additional MPI
>> communication calls for debugging purpose. Sometimes it helps narrow
>> down the problem cause. It can be enabled by setting environment
>> variable PNETCDF_SAFE_MODE to 1.
>>
>> Wei-keng
>>
>> On Jun 21, 2022, at 5:03 PM, Jim Edwards <jedwards at ucar.edu> wrote:
>>
>> I am using pnetcdf 1.12.3 and getting an error when compiled with
>> intel/19.1.1 and impi/19.0.9 on the TACC Frontera system
>> I am getting very little information to guide me in debugging the error.
>>
>>
>> [785] Abort(634628) on node 785 (rank 785 in comm 0): Fatal error in
>> PMPI_Issend: Invalid tag, error stack:
>> [785] PMPI_Issend(156): MPI_Issend(buf=0x2b5c81edf40f, count=1025120,
>> MPI_BYTE, dest=0, tag=1048814, comm=0xc40000d7, request=0x7f2002783540)
>> failed
>> [785] PMPI_Issend(95).: Invalid tag, value is 1048814
>> TACC:  MPI job exited with code: 4
>> TACC:  Shutdown complete. Exiting.
>>
>>
>> I can tell that I am in a call to ncmpi_enddef but not getting anything
>> beyond that - any ideas?
>>
>> --
>> Jim Edwards
>>
>> CESM Software Engineer
>> National Center for Atmospheric Research
>> Boulder, CO
>>
>>
>>
>
> --
> Jim Edwards
>
> CESM Software Engineer
> National Center for Atmospheric Research
> Boulder, CO
>
>
>

-- 
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20220628/7b9ddaaf/attachment.html>


More information about the parallel-netcdf mailing list