[Darshan-users] COMM Split Error

Hassan Asghar haxxanasghar at gmail.com
Thu Jun 17 17:16:33 CDT 2021


Thanks for your response.

My mpich version is 3.0.

On Fri, 18 Jun 2021, 6:24 am Snyder, Shane, <ssnyder at mcs.anl.gov> wrote:

> Hi Hassan,
>
> I'm assuming you don't see this error if Darshan isn't preloaded?
>
> I just built MADbench2 on my system, and it runs fine for me with Darshan
> preloaded and generates a log. I'm just running 16 processes on my laptop
> using MPICH 3.2.1.
>
> Can you share more details about your setup? For starters, what MPI are
> you using, what version of Darshan are you using, and how have you
> configured Darshan? I can't really think of any reason Darshan would cause
> a crash in MPI_Comm_split off the top of my head, so might need to find a
> way for me to reproduce the issue.
>
> Thanks,
> --Shane
> ------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf
> of Hassan Asghar <haxxanasghar at gmail.com>
> *Sent:* Wednesday, June 16, 2021 3:25 AM
> *To:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* [Darshan-users] COMM Split Error
>
> I am facing the following issue: Please help
>
> [haxxanasghar at gpuserver2 ~]$ *mpiexec -n 16 -f machinefile -env
> LD_PRELOAD=/home/haxxanasghar/darshan/darshan-runtime/lib/libdarshan.so
> ./MADbench2 640 80 4 8 8 4 4*
>
> MADbench 2.0 IO-mode
> no_pe = 16  no_pix = 640  no_bin = 80  no_gang = 4  sblocksize = 8
>  fblocksize = 8  r_mod = 4  w_mod = 4
> IOMETHOD = POSIX  IOMODE = SYNC  FILETYPE = UNIQUE  REMAP = CUSTOM
>
> Fatal error in PMPI_Comm_split: A process has failed, error stack:
> PMPI_Comm_split(474)......: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=2,
> new_comm=0x6073e0) failed
> PMPI_Comm_split(456)......:
> MPIR_Comm_split_impl(143).:
> MPIR_Allgather_impl(807)..:
> MPIR_Allgather(766).......:
> MPIR_Allgather_intra(181).:
> dequeue_and_set_error(888): Communication error with rank 3
> MPIR_Allgather_intra(181).:
> dequeue_and_set_error(888): Communication error with rank 6
> Fatal error in PMPI_Comm_split: A process has failed, error stack:
> PMPI_Comm_split(474)......: MPI_Comm_split(MPI_COMM_WORLD, color=3, key=1,
> new_comm=0x6073e0) failed
> PMPI_Comm_split(456)......:
> MPIR_Comm_split_impl(143).:
> MPIR_Allgather_impl(807)..:
> MPIR_Allgather(766).......:
> MPIR_Allgather_intra(181).:
> dequeue_and_set_error(888): Communication error with rank 12
> MPIR_Allgather_intra(181).:
> dequeue_and_set_error(888): Communication error with rank 15
> MPIR_Allgather_intra(181).:
> dequeue_and_set_error(888): Communication error with rank 9
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> [mpiexec at gpuserver2] HYDU_sock_read (./utils/sock/sock.c:243): read error
> (Bad file descriptor)
> [mpiexec at gpuserver2] control_cb (./pm/pmiserv/pmiserv_cb.c:201): unable
> to read command from proxy
> [mpiexec at gpuserver2] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at gpuserver2] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at gpuserver2] main (./ui/mpich/mpiexec.c:331): process manager
> error waiting for completion
> [haxxanasghar at gpuserver2 ~]$
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210618/3bb6b490/attachment.html>


More information about the Darshan-users mailing list