Date: Thu, 23 Dec 2004 04:25:20 -0600 (CST) From: Jianwei Li To: mpich2-maint@mcs.anl.gov Subject: [MPICH2 Req #1174] ROMIO interleaving (0-bytes) write_all bug Message-Id: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at mailgw.mcs.anl.gov Cc: mpich2-maint@mcs.anl.gov Hi, I just traced a ROMIO collective write bug in MPICH2-1.0. I noticed it when I was developing the Parallel netCDF code accross multiple platforms with different MPI implementations. My code didn't break on IBM MPI on an IBM SP machine, or on a Linux cluster with MPICH1 v1.2.5, but it breaks on my latest installed MPICH2 on a Linux machine. System where bug is detected: Intel Pentium III (Cascades) SMP-8 (1 node, 8CPU) Linux 2.4.9-e.40smp gcc 2.96 MPICH2-1.0 built with default setting (w/ ROMIO) Bug Description: When N processes participate in a collective write operation, some with normal (offset, length), but others with length == 0 and offset falling within some other (offset, length) range by chance, the collective operation will crash for most of the cases. (quite a few cases survive, so one may need to be very careful and patient to track the bug. Try some odd/prime number:) Luckily, testing the bug is as simple as MPI_File_write_at_all(offset, length, MPI_BYTE), with MPI_INFO_NULL. e.g. P0 ( off_0 = 0, len_0 > 2 ) P1 ( off_1 < len_0 - 1, len_1 = 0 ) P2 ( off_2 < len_0 - 1, len_2 = 0 ) P3 ( off_3 < len_0 - 1, len_3 = 0 ) ....... ........ just running the example with various number of processes will show the crashing error message like below: Error Message: ##########################START OF ERR MSG############################# aborting job: Fatal error in MPI_Barrier: Other MPI error, error stack: MPI_Barrier(375): MPI_Barrier(MPI_COMM_WORLD) failed MPIR_Barrier(75): MPIC_Sendrecv(161): MPIC_Wait(308): MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(492): connection_recv_fail(1728): MPIDU_Socki_handle_read(614): connection failure (set=0,sock=3,errno=104:Connection reset by peer) aborting job: Fatal error in MPI_Barrier: Other MPI error, error stack: MPI_Barrier(375): MPI_Barrier(MPI_COMM_WORLD) failed MPIR_Barrier(75): MPIC_Sendrecv(161): MPIC_Wait(308): MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(492): connection_recv_fail(1728): MPIDU_Socki_handle_read(614): connection failure (set=0,sock=1,errno=104:Connection reset by peer) rank 2 in job 400 spiderbox.ece.northwestern.edu_54769 caused collective abort of all ranks exit status of rank 2: return code 13 rank 0 in job 400 spiderbox.ece.northwestern.edu_54769 caused collective abort of all ranks exit status of rank 0: killed by signal 11 ##########################END OF ERR MSG############################# Application Affected: Generally, all parallel applications that happen to have one or more processes writing 0 elements at some overlapping offset, collectively with a group of normal writing processes, may suffer from this bug. Typically, some darray/subarray dynamic partitioned data or odd-processes partitions might result in such access patterns. First Bug Location (With Possible Solution): mpich2-1.0/src/mpi/romio/adio/common/ad_write_coll.c line 120 of 1013: if (st_offsets[i] < end_offsets[i-1]) interleave_count++; When counting the "interleave_count", segments with length == 0 should not be counted in even if their starting offsets fall within previous segment range. To fix this, I modified the above line to: if ( st_offsets[i] < end_offsets[i-1] && st_offsets[i] <= end_offsets[i] ) interleave_count++; There may be other direct/serious cause of this bug, but for now, the above solution is the most straightforward way to address this issue, and it works fine with me so far. Jianwei