[Darshan-users] COMM Split Error

Snyder, Shane ssnyder at mcs.anl.gov
Thu Jun 17 16:24:16 CDT 2021


Hi Hassan,

I'm assuming you don't see this error if Darshan isn't preloaded?

I just built MADbench2 on my system, and it runs fine for me with Darshan preloaded and generates a log. I'm just running 16 processes on my laptop using MPICH 3.2.1.

Can you share more details about your setup? For starters, what MPI are you using, what version of Darshan are you using, and how have you configured Darshan? I can't really think of any reason Darshan would cause a crash in MPI_Comm_split off the top of my head, so might need to find a way for me to reproduce the issue.

Thanks,
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of Hassan Asghar <haxxanasghar at gmail.com>
Sent: Wednesday, June 16, 2021 3:25 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: [Darshan-users] COMM Split Error

I am facing the following issue: Please help

[haxxanasghar at gpuserver2 ~]$ mpiexec -n 16 -f machinefile -env LD_PRELOAD=/home/haxxanasghar/darshan/darshan-runtime/lib/libdarshan.so ./MADbench2 640 80 4 8 8 4 4

MADbench 2.0 IO-mode
no_pe = 16  no_pix = 640  no_bin = 80  no_gang = 4  sblocksize = 8  fblocksize = 8  r_mod = 4  w_mod = 4
IOMETHOD = POSIX  IOMODE = SYNC  FILETYPE = UNIQUE  REMAP = CUSTOM

Fatal error in PMPI_Comm_split: A process has failed, error stack:
PMPI_Comm_split(474)......: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=2, new_comm=0x6073e0) failed
PMPI_Comm_split(456)......:
MPIR_Comm_split_impl(143).:
MPIR_Allgather_impl(807)..:
MPIR_Allgather(766).......:
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 3
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 6
Fatal error in PMPI_Comm_split: A process has failed, error stack:
PMPI_Comm_split(474)......: MPI_Comm_split(MPI_COMM_WORLD, color=3, key=1, new_comm=0x6073e0) failed
PMPI_Comm_split(456)......:
MPIR_Comm_split_impl(143).:
MPIR_Allgather_impl(807)..:
MPIR_Allgather(766).......:
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 12
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 15
MPIR_Allgather_intra(181).:
dequeue_and_set_error(888): Communication error with rank 9

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[mpiexec at gpuserver2] HYDU_sock_read (./utils/sock/sock.c:243): read error (Bad file descriptor)
[mpiexec at gpuserver2] control_cb (./pm/pmiserv/pmiserv_cb.c:201): unable to read command from proxy
[mpiexec at gpuserver2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at gpuserver2] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec at gpuserver2] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
[haxxanasghar at gpuserver2 ~]$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210617/69f87fb6/attachment.html>


More information about the Darshan-users mailing list