PnetCDF: MPI error with large number of processes

Lukas Umek lukas.umek at gmail.com
Thu Jul 7 07:17:02 CDT 2022


Hi,
I am using PnetCDF v1.12.2 read&write large netCDF files (64bit offset and
CDF5 formats, > 10GB per file) with the WRF model. This works fine up to a
certain number of MPI processes.  Running on 4080 MPI processes works but a
job with 4200 MPI processes fails during I/O. An example for the error
message I get is below:

Invalid error code (-1) (error ring index 127 invalid)
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in
MPIDI_NM_mpi_allgather:202
Abort(873534479) on node 1450 (rank 1450 in comm 0): Fatal error in
PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(401)..........................:
MPI_Allgather(sbuf=0x7ffc94b87a48, scount=1, MPI_LONG_LONG_INT,
rbuf=0xd1bba70, rcount=1, datatype=MPI_LONG_LONG_INT, comm=comm=0xc400001a)
failed
MPIDI_Allgather_intra_composition_alpha(1844):
MPIDI_NM_mpi_allgather(202)..................:

This is happening with Intel MPI 2019.9 and 2021.2. When I
use mvapich2-2.3.5
I am able to write files with PnetCDF with more MPI processes involved
(e.g. I tried up to 5760 MPI processes and that worked). However
performance is much degraded when using mvapich so this is not really an
option (time for writing to the disks more than triples compared to jobs
using intelMPI with the same core count and data).

My problem sounds similar to some threads I found online:
-
https://lists.mcs.anl.gov/pipermail/parallel-netcdf/2013-August/001519.html
-
https://lists.mcs.anl.gov/pipermail/parallel-netcdf/2010-October/001143.html
(Setting the MPI_TYPE_MAX  suggested in the second post did not help with
my problem.)

Is anybody aware of some limitations intelMPI imposes when used with
PnetCDF?

cheers,
Lukas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/parallel-netcdf/attachments/20220707/3e6890cd/attachment.html>


More information about the parallel-netcdf mailing list