Date: Thu, 23 Dec 2004 04:25:20 -0600 (CST)
From: Jianwei Li <jianwei@ece.northwestern.edu>
To: mpich2-maint@mcs.anl.gov
Subject: [MPICH2 Req #1174] ROMIO interleaving (0-bytes) write_all bug
Message-Id: <Pine.GSO.4.58.0412230245080.9133@twister.ece.northwestern.edu>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at mailgw.mcs.anl.gov
Cc: mpich2-maint@mcs.anl.gov


Hi,

I just traced a ROMIO collective write bug in MPICH2-1.0.
I noticed it when I was developing the Parallel netCDF
code accross multiple platforms with different MPI
implementations. My code didn't break on IBM MPI on
an IBM SP machine, or on a Linux cluster with MPICH1
v1.2.5, but it breaks on my latest installed MPICH2
on a Linux machine.


System where bug is detected:

	Intel Pentium III (Cascades) SMP-8 (1 node, 8CPU)
	Linux 2.4.9-e.40smp
	gcc 2.96
	MPICH2-1.0 built with default setting (w/ ROMIO)


Bug Description:

	When N processes participate in a collective
	write operation, some with normal (offset,
	length), but others with length == 0 and
	offset falling within some other (offset,
	length) range by chance, the collective
	operation will crash for most of the cases.
	(quite a few cases survive, so one may need
	to be very careful and patient to track the
	bug. Try some odd/prime number:)
	Luckily, testing the bug is as simple as
 	MPI_File_write_at_all(offset, length, MPI_BYTE),
	with MPI_INFO_NULL.

	e.g. P0	( off_0 = 0,		len_0 > 2 )
	     P1	( off_1 < len_0 - 1,	len_1 = 0 )
	     P2	( off_2 < len_0 - 1,  	len_2 = 0 )
	     P3	( off_3 < len_0 - 1,	len_3 = 0 )
		  .......		........
	just running the example with various number of
	processes will show the crashing error message
	like below:


Error Message:

##########################START OF ERR MSG#############################
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(375): MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(75):
MPIC_Sendrecv(161):
MPIC_Wait(308):
MPIDI_CH3_Progress_wait(207): an error occurred while handling an event
returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(492):
connection_recv_fail(1728):
MPIDU_Socki_handle_read(614): connection failure
(set=0,sock=3,errno=104:Connection reset by peer)
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(375): MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(75):
MPIC_Sendrecv(161):
MPIC_Wait(308):
MPIDI_CH3_Progress_wait(207): an error occurred while handling an event
returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(492):
connection_recv_fail(1728):
MPIDU_Socki_handle_read(614): connection failure
(set=0,sock=1,errno=104:Connection reset by peer)
rank 2 in job 400  spiderbox.ece.northwestern.edu_54769   caused
collective abort of all ranks
  exit status of rank 2: return code 13
rank 0 in job 400  spiderbox.ece.northwestern.edu_54769   caused
collective abort of all ranks
  exit status of rank 0: killed by signal 11
##########################END OF ERR MSG#############################


Application Affected:

	Generally, all parallel applications that happen to
	have one or more processes writing 0 elements at
	some overlapping offset, collectively with a group of
	normal writing processes, may suffer from this bug.
	Typically, some darray/subarray dynamic partitioned
	data or odd-processes partitions might result in such
	access patterns.


First Bug Location (With Possible Solution):

	mpich2-1.0/src/mpi/romio/adio/common/ad_write_coll.c
	line 120 of 1013:

	if (st_offsets[i] < end_offsets[i-1]) interleave_count++;

	When counting the "interleave_count", segments with
	length == 0 should not be counted in even if their
	starting offsets fall within previous segment range.

	To fix this, I modified the above line to:

		if ( st_offsets[i] < end_offsets[i-1]
	         &&  st_offsets[i] <= end_offsets[i] )
		    interleave_count++;


There may be other direct/serious cause of this bug, but for now,
the above solution is the most straightforward way to address this
issue, and it works fine with me so far.


Jianwei