error in enddef

Wei-Keng Liao wkliao at northwestern.edu
Tue Jun 28 10:51:57 CDT 2022


Hi, Jim

Thanks for the update.

I am wondering if your I/O pattern produces a large number
of noncontiguous file access requests in each MPI process.
Because ROMIO used MPI tags in its implementation of 2-phase I/O,
this pattern can result in a large number of MPI isend/irecv,
each uses a unique MPI tag. The latest ROMIO has fixed this for
Lustre (https://github.com/pmodels/mpich/pull/5660.)

Wei-keng

On Jun 28, 2022, at 9:41 AM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

Hi Wei-Keng,

I found the issue with help from TACC user support:
https://www.intel.com/content/www/us/en/developer/articles/technical/large-mpi-tags-with-the-intel-mpi.html<https://urldefense.com/v3/__https://www.intel.com/content/www/us/en/developer/articles/technical/large-mpi-tags-with-the-intel-mpi.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!VT-ST0BuSkBPBDMQPJpRTv8AHEeTsBLq5E4LV_kjmYjFUDJoc3Z0jCGiDj3ItqkQ5t8Pgm2wZa23ViXRQ8jKf48$>

I set Setting Environment MPIR_CVAR_CH4_OFI_RANK_BITS=15
   Setting Environment MPIR_CVAR_CH4_OFI_TAG_BITS=24
and added a print statement:
cam_restart.F90     123 Maximum tag value queried   8388607
this appears to be working.


On Tue, Jun 21, 2022 at 7:25 PM Wei-Keng Liao <wkliao at northwestern.edu<mailto:wkliao at northwestern.edu>> wrote:
Hi, Jim

Is the ncmpi_enddef the first enddef call after the file creation,
or after a ncmpi_redef?

In the former case, there is no MPI communication in PnetCDF, except
for an MPI_Barrier. In the latter case, if the file header size expands,
existing variables need to be moved to higher offsets, which require
PnetCDF to call MPI collective reads and writes and thus leads to MPI_Issend.

Can you try to get a coredump so to trace the call stacks?

You can also enable PnetCDF safe mode which will make additional MPI
communication calls for debugging purpose. Sometimes it helps narrow
down the problem cause. It can be enabled by setting environment
variable PNETCDF_SAFE_MODE to 1.

Wei-keng

On Jun 21, 2022, at 5:03 PM, Jim Edwards <jedwards at ucar.edu<mailto:jedwards at ucar.edu>> wrote:

I am using pnetcdf 1.12.3 and getting an error when compiled with intel/19.1.1 and impi/19.0.9 on the TACC Frontera system
I am getting very little information to guide me in debugging the error.

[785] Abort(634628) on node 785 (rank 785 in comm 0): Fatal error in PMPI_Issend: Invalid tag, error stack:
[785] PMPI_Issend(156): MPI_Issend(buf=0x2b5c81edf40f, count=1025120, MPI_BYTE, dest=0, tag=1048814, comm=0xc40000d7, request=0x7f2002783540) failed
[785] PMPI_Issend(95).: Invalid tag, value is 1048814
TACC:  MPI job exited with code: 4
TACC:  Shutdown complete. Exiting.


I can tell that I am in a call to ncmpi_enddef but not getting anything beyond that - any ideas?

--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO



--
Jim Edwards

CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20220628/8f975544/attachment-0001.html>


More information about the parallel-netcdf mailing list