[Darshan-users] COMM Split Error

Hassan Asghar haxxanasghar at gmail.com
Wed Jun 16 03:25:35 CDT 2021


I am facing the following issue: Please help

[haxxanasghar at gpuserver2 ~]$ *mpiexec -n 16 -f machinefile -env
LD_PRELOAD=/home/haxxanasghar/darshan/darshan-runtime/lib/libdarshan.so
./MADbench2 640 80 4 8 8 4 4*

MADbench 2.0 IO-mode
no_pe = 16  no_pix = 640  no_bin = 80  no_gang = 4  sblocksize = 8
 fblocksize = 8  r_mod = 4  w_mod = 4
IOMETHOD = POSIX  IOMODE = SYNC  FILETYPE = UNIQUE  REMAP = CUSTOM

Fatal error in PMPI_Comm_split: A process has failed, error stack:
PMPI_Comm_split(474)......: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=2,
new_comm=0x6073e0) failed
PMPI_Comm_split(456)......:
MPIR_Comm_split_impl(143).:
MPIR_Allgather_impl(807)..:
MPIR_Allgather(766).......:
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 3
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 6
Fatal error in PMPI_Comm_split: A process has failed, error stack:
PMPI_Comm_split(474)......: MPI_Comm_split(MPI_COMM_WORLD, color=3, key=1,
new_comm=0x6073e0) failed
PMPI_Comm_split(456)......:
MPIR_Comm_split_impl(143).:
MPIR_Allgather_impl(807)..:
MPIR_Allgather(766).......:
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 12
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 15
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 9

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[mpiexec at gpuserver2] HYDU_sock_read (./utils/sock/sock.c:243): read error
(Bad file descriptor)
[mpiexec at gpuserver2] control_cb (./pm/pmiserv/pmiserv_cb.c:201): unable to
read command from proxy
[mpiexec at gpuserver2] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at gpuserver2] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec at gpuserver2] main (./ui/mpich/mpiexec.c:331): process manager error
waiting for completion
[haxxanasghar at gpuserver2 ~]$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210616/5b50758a/attachment.html>


More information about the Darshan-users mailing list