[MPICH] Parallel I/O problems on 64-bit machine (pleasehelp:-( )

Fri May 26 16:33:47 CDT 2006

Hello back YoSung and Rajeev,

Indeed, I did try configuring MPICH2 (and MPICH1) to accomodate
8 byte integers. When I do that I get errors with MPI_TYPE_CREATE_SUBARRAY,
which state that the first element of the array of sizes is set to 0. Now, 
my global array
has no non-zero dimension and that is confirmed by printing out gsizes(i) 
(i=1,...,3). Hm ?

I really apologize if I'm troubling you guys with something totally simple. 
Following
Rajeev's request, I've attached a sample program. It consists of three 
Fortran source
codes:
a) Main: the main driver.
b) mpi_setup: I had originally planned to use a 2-D domain decomposition but 
I've ended
up working with 1-D so this is more or less superfluous. It's only needed 
when
setting up the local starting indices.
c) output: The routine which gives me problems.

I've attached the corresponding makefile to compile with Absoft mpif90. The 
file
dim.h simply specifies the dimensions of some arrays, in particular the test 
array u(...,...,...)
which is dumped out in single precision.

When I run this simple code on my 32-bit machine it works w/o a problem.
When I do it on the 64-bit I get the same old SIGSEGV 11 error from 
MPI_FILE_WRITE_ALL .

Again, I hope I'm not being a hassle.

Any insight on this sample code would be greatly appreciated. I still owe 
you folks beers :-)

Cheers,

Peter

----- Original Message ----- 
From: "Yusong Wang" <ywang25 at aps.anl.gov>
To: "Peter Diamessis" <pjd38 at cornell.edu>
Cc: "Ashley Pittman" <ashley at quadrics.com>; <thakur at mcs.anl.gov>; 
<mpich-discuss at mcs.anl.gov>
Sent: Friday, May 26, 2006 5:04 PM
Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine 
(pleasehelp:-( )

> According to the manual, the Fortran standard requires that INTEGER and
> REAL be the same size. The complier you used doesn't confirm to the
> standard, assuming the manual is still right now (i.e. the standard has
> not been changed).  For your case, it seems to me you may need force the
> integer to be 8 bytes when configuring MPICH2. Further more, 128 bits
> for DOUBLE PRECISION quantities are required for such a compiler. It
> shouldn't be hard to check the size of those variables with a small test
> in real time.
>
> These are just some of my suggestions to try. I can't eliminate the
> possibility of MPI_FILE_WRITE_ALL has its own problem.
>
> Yusong
>
>
> On Fri, 2006-05-26 at 15:16 -0400, Peter Diamessis wrote:
>> Hi folks,
>>
>> Well, I did read the specific question pointed out by Yosung
>> in the MPICH2 manual. It seems to be that this is specific to
>> the GNU F95 compiler. The Absoft F90 compiler uses a default
>> 4-byte length for integers and 8-bytes for reals, i.e. there is no
>> such conflict. It seems to me that configuring MPICH2 with -i4
>> is pretty much superfluous.
>>
>> Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
>> and v1.2.7p1) and I get the same error. I even tried -i8 for the heck
>> of it and I run into a whole new suite of problems. I repeat, MPICH 
>> v1.2.6
>> (including I/O)
>> has worked beautifully for me on 32-bit machines. If I don't call my MPI
>> parallel
>> I/O routines, and more specifically I comment out the calls to
>> MPI_FILE_WRITE_ALL
>> and MPI_FILE_READ_ALL, the rest of the code works perfectly fine on
>> a 64-bit machine (including other MPI I/O calls).
>>
>> So is this what Ashley pointed out ? A bug specific to MPI_FILE_WRITE_ALL
>> and MPI_FILE_READ_ALL ?
>>
>> Any additional feedback would be very welcome.
>>
>> Many thanks in advance,
>>
>> Peter
>>
>>
>> ----- Original Message ----- 
>> From: "Ashley Pittman" <ashley at quadrics.com>
>> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
>> Sent: Wednesday, May 24, 2006 7:07 AM
>> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine (
>> pleasehelp:-( )
>>
>>
>> >
>> > The structf failure on 64 bit machines is a bug in the spec, not a bug
>> > in compiler.  In effect the spec itself isn't 64bit safe.  Following
>> > down the path of the structf error will lead to a dead end.
>> >
>> > I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL recently, I'll 
>> > see
>> > if I can dig up my notes about it.
>> >
>> > Ashley,
>> >
>> >
>> > On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
>> >> Thanks a-many YoSung,
>> >>
>> >> I'll contact the Absoft people to see if there is a similar issue
>> >> with their F90-95 compiler. I have to be on travel tomorrow
>> >> but I'll get back to this on Thursday.
>> >>
>> >> The pointer is much appreciated,
>> >>
>> >> Peter
>> >>
>> >> ----- Original Message ----- 
>> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
>> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> Cc: <mpich-discuss at mcs.anl.gov>
>> >> Sent: Tuesday, May 23, 2006 5:53 PM
>> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine ( please
>> >> help:-( )
>> >>
>> >>
>> >> > You might have read this from the manual. Just in case if it could
>> >> > help.
>> >> >
>> >> > D.4 Q: When I use the g95 Fortran compiler on a 64-bit platform, 
>> >> > some
>> >> > of
>> >> > the tests fail
>> >> >
>> >> > A: The g95 compiler incorrectly defines the default Fortran integer 
>> >> > as
>> >> > a
>> >> > 64- bit integer while defining Fortran reals as 32-bit values (the
>> >> > Fortran standard requires that INTEGER and REAL be the same size). 
>> >> > This
>> >> > was apparently done to allow a Fortran INTEGER to hold the value of 
>> >> > a
>> >> > pointer, rather than requiring the programmer to select an INTEGER 
>> >> > of a
>> >> > suitable KIND. To force the g95 compiler to correctly implement the
>> >> > Fortran standard, use the -i4 flag. For example, set the environment
>> >> > variable F90FLAGS before configuring MPICH2: setenv F90FLAGS "-i4" 
>> >> > G95
>> >> > users should note that there (at this writing) are two distributions 
>> >> > of
>> >> > g95 for 64-bit Linux platforms. One uses 32-bit integers and reals 
>> >> > (and
>> >> > conforms to the Fortran standard) and one uses 32-bit integers and 
>> >> > 64-
>> >> > bit reals. We recommend using the one that conforms to the standard
>> >> > (note that the standard specifies the ratio of sizes, not the 
>> >> > absolute
>> >> > sizes, so a Fortran 95 compiler that used 64 bits for both INTEGER 
>> >> > and
>> >> > REAL would also conform to the Fortran standard. However, such a
>> >> > compiler would need to use 128 bits for DOUBLE PRECISION 
>> >> > quantities).
>> >> >
>> >> > Yusong
>> >> >
>> >> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
>> >> >> Hi again,
>> >> >>
>> >> >> I'm still obsessing as to why MPI I/O fails on my 64-bit machine.
>> >> >> I've decided to set MPICH2 aside and work with MPICH v1.2.6 which
>> >> >> is the one version that worked reliably for me. This is the latest 
>> >> >> I
>> >> >> observed.
>> >> >>
>> >> >> I guessed that some integer argument must be passed wrong when 
>> >> >> using
>> >> >> a 64-bit machine. I recompiled the code (I use Absoft Pro Fortran
>> >> >> 10.0)
>> >> >> and forced the default size of  integers to be 8 bytes. Lo behold 
>> >> >> my
>> >> >> I/O
>> >> >> routine crashes at an earlier point with the following interesting
>> >> >> message:
>> >> >>
>> >> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in array_of_sizes[1]=0 
>> >> >> .
>> >> >>
>> >> >> Now, all the elements of the array os fizes should be non-zero
>> >> >> integers,
>> >> >> e.g. 64, 64, 175 . Is some information on integers being screwed up 
>> >> >> in
>> >> >> the
>> >> >> 64-bit
>> >> >> layout ?
>> >> >>
>> >> >> Note that after a few secs. of hanging I also get the followign:
>> >> >>
>> >> >> p0_25936: (0.089844) net_send: could not write to fd=4, errno = 32
>> >> >>
>> >> >> This is the exact same error I get when running ' make testing ' 
>> >> >> after
>> >> >> having installed MPICH, i.e.:
>> >> >>
>> >> >> *** Testing Type_struct from Fortran ***
>> >> >> Differences in structf.out
>> >> >> 2,7c2
>> >> >> < 0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does 
>> >> >> not
>> >> >> fit
>> >> >> in
>> >> >> Fortran integer
>> >> >> < [0]  Aborting program !
>> >> >> < [0] Aborting program!
>> >> >> < p0_25936:  p4_error: : 972
>> >> >> < Killed by signal 2.
>> >> >> < p0_25936: (0.089844) net_send: could not write to fd=4, errno = 
>> >> >> 32
>> >> >>
>> >> >> Again, any help would be hugely appreciated. I'll buy you guys 
>> >> >> beers !
>> >> >>
>> >> >> Many thanks,
>> >> >>
>> >> >> Peter
>> >> >>
>> >> >>
>> >> >> ----- Original Message ----- 
>> >> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> >> To: <mpich-discuss at mcs.anl.gov>
>> >> >> Sent: Monday, May 22, 2006 2:33 PM
>> >> >> Subject: [MPICH] Parallel I/O problems on 64-bit machine ( please 
>> >> >> help
>> >> >> :-( )
>> >> >>
>> >> >>
>> >> >> > Hello folks,
>> >> >> >
>> >> >> > I'm writing this note to ask some help with running MPI on
>> >> >> > a dual proc. 64-bit Linux box I just acquired. I've written a
>> >> >> > similar
>> >> >> > not to the mpi-bugs address but would appreciate any additional
>> >> >> > help from anyone else in the community.
>> >> >> >
>> >> >> > I'm using MPICH v1.2.7p1,
>> >> >> > which, when tested,  seems to work wonderfully with everything
>> >> >> > except
>> >> >> > for
>> >> >> > some specific parallel I/O calls.
>> >> >> >
>> >> >> > Specifically, whenever there is a call to MPI_FILE_WRITE_ALL
>> >> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
>> >> >> > these I/O dumps are part of a greater CFD code which
>> >> >> > has worked fine on either a 32-bit dual proc. Linux workstation
>> >> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
>> >> >> >
>> >> >> > In  my message to mpi-bugs, I did attach a variety of files that
>> >> >> > could provide additional insight. In this case I'm attaching only
>> >> >> > the Fortran source code I can gladly provide more material
>> >> >> > anyone who may be interested.The troublesome Fortran call is:
>> >> >> >
>> >> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
>> >> >> >> MPI_REAL,
>> >> >> >> MPI_STATUS_IGNORE)
>> >> >> >
>> >> >> > Upon call this, the program crashes with a SIGSEGV 11 error.
>> >> >> > Evidently,
>> >> >> > some memory is accessed out of core ?
>> >> >> >
>> >> >> > Tempout is a single precision (Real with kind=4) 3-D array, which
>> >> >> > has a
>> >> >> > total local
>> >> >> > number of elements on each processor equal to local_array_size.
>> >> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
>> >> >> > status_array si appropriately dimensioned) I find that upon 
>> >> >> > error,
>> >> >> > printing out the elements of status_array yields these huge 
>> >> >> > values.
>> >> >> > This error always is always localized on processor (N+1)/2 (proc.
>> >> >> > numbering
>> >> >> > goes from 0 to N-1).
>> >> >> >
>> >> >> > I installed MPICH2 only to observe the same results.
>> >> >> > Calls to MPI_FILE_READ_ALL will also produce identical effects.
>> >> >> > I'll reiterate that we've never had problems with this code on
>> >> >> > 32-bit
>> >> >> > machines.
>> >> >> >
>> >> >> > Note that uname -a returns:
>> >> >> >
>> >> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP Wed Jan 5
>> >> >> > 19:29:47
>> >> >> > EST
>> >> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> >
>> >> >> > Am I running into problems because I've got a 64-bit configured
>> >> >> > Linux
>> >> >> > on a
>> >> >> > 64-bit
>> >> >> > machine.
>> >> >> >
>> >> >> > Any help would HUGELY appreciated. The ability to use MPI2 
>> >> >> > parallel
>> >> >> > I/O
>> >> >> > on
>> >> >> > our workstation would greatly help us crunch through some 
>> >> >> > existing
>> >> >> > large
>> >> >> > datafiles
>> >> >> > generated on 32-bit machines.
>> >> >> >
>> >> >> > Cheers,
>> >> >> >
>> >> >> > Peter
>> >> >> >
>> >> >> > -------------------------------------------------------------
>> >> >> > Peter Diamessis
>> >> >> > Assistant Professor
>> >> >> > Environmental Fluid Mechanics & Hydrology
>> >> >> > School of Civil and Environmental Engineering
>> >> >> > Cornell University
>> >> >> > Ithaca, NY 14853
>> >> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
>> >> >> > pjd38 at cornell.edu
>> >> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >
>>
>>
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: makefile
Type: application/octet-stream
Size: 4588 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060526/92b924b5/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_setup.f
Type: application/octet-stream
Size: 1004 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060526/92b924b5/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output.f
Type: application/octet-stream
Size: 4925 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060526/92b924b5/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dim.h
Type: application/octet-stream
Size: 2275 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060526/92b924b5/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.f
Type: application/octet-stream
Size: 3643 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060526/92b924b5/attachment-0004.obj>