[MPICH] Parallel I/O problems on 64-bit machine ( pleasehelp:-( )

Peter Diamessis pjd38 at cornell.edu
Fri May 26 14:16:04 CDT 2006


Hi folks,

Well, I did read the specific question pointed out by Yosung
in the MPICH2 manual. It seems to be that this is specific to
the GNU F95 compiler. The Absoft F90 compiler uses a default
4-byte length for integers and 8-bytes for reals, i.e. there is no
such conflict. It seems to me that configuring MPICH2 with -i4
is pretty much superfluous.

Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
and v1.2.7p1) and I get the same error. I even tried -i8 for the heck
of it and I run into a whole new suite of problems. I repeat, MPICH v1.2.6 
(including I/O)
has worked beautifully for me on 32-bit machines. If I don't call my MPI 
parallel
I/O routines, and more specifically I comment out the calls to 
MPI_FILE_WRITE_ALL
and MPI_FILE_READ_ALL, the rest of the code works perfectly fine on
a 64-bit machine (including other MPI I/O calls).

So is this what Ashley pointed out ? A bug specific to MPI_FILE_WRITE_ALL
and MPI_FILE_READ_ALL ?

Any additional feedback would be very welcome.

Many thanks in advance,

Peter


----- Original Message ----- 
From: "Ashley Pittman" <ashley at quadrics.com>
To: "Peter Diamessis" <pjd38 at cornell.edu>
Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
Sent: Wednesday, May 24, 2006 7:07 AM
Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine ( 
pleasehelp:-( )


>
> The structf failure on 64 bit machines is a bug in the spec, not a bug
> in compiler.  In effect the spec itself isn't 64bit safe.  Following
> down the path of the structf error will lead to a dead end.
>
> I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL recently, I'll see
> if I can dig up my notes about it.
>
> Ashley,
>
>
> On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
>> Thanks a-many YoSung,
>>
>> I'll contact the Absoft people to see if there is a similar issue
>> with their F90-95 compiler. I have to be on travel tomorrow
>> but I'll get back to this on Thursday.
>>
>> The pointer is much appreciated,
>>
>> Peter
>>
>> ----- Original Message ----- 
>> From: "Yusong Wang" <ywang25 at aps.anl.gov>
>> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> Cc: <mpich-discuss at mcs.anl.gov>
>> Sent: Tuesday, May 23, 2006 5:53 PM
>> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine ( please
>> help:-( )
>>
>>
>> > You might have read this from the manual. Just in case if it could 
>> > help.
>> >
>> > D.4 Q: When I use the g95 Fortran compiler on a 64-bit platform, some 
>> > of
>> > the tests fail
>> >
>> > A: The g95 compiler incorrectly defines the default Fortran integer as 
>> > a
>> > 64- bit integer while defining Fortran reals as 32-bit values (the
>> > Fortran standard requires that INTEGER and REAL be the same size). This
>> > was apparently done to allow a Fortran INTEGER to hold the value of a
>> > pointer, rather than requiring the programmer to select an INTEGER of a
>> > suitable KIND. To force the g95 compiler to correctly implement the
>> > Fortran standard, use the -i4 flag. For example, set the environment
>> > variable F90FLAGS before configuring MPICH2: setenv F90FLAGS "-i4" G95
>> > users should note that there (at this writing) are two distributions of
>> > g95 for 64-bit Linux platforms. One uses 32-bit integers and reals (and
>> > conforms to the Fortran standard) and one uses 32-bit integers and 64-
>> > bit reals. We recommend using the one that conforms to the standard
>> > (note that the standard specifies the ratio of sizes, not the absolute
>> > sizes, so a Fortran 95 compiler that used 64 bits for both INTEGER and
>> > REAL would also conform to the Fortran standard. However, such a
>> > compiler would need to use 128 bits for DOUBLE PRECISION quantities).
>> >
>> > Yusong
>> >
>> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
>> >> Hi again,
>> >>
>> >> I'm still obsessing as to why MPI I/O fails on my 64-bit machine.
>> >> I've decided to set MPICH2 aside and work with MPICH v1.2.6 which
>> >> is the one version that worked reliably for me. This is the latest I
>> >> observed.
>> >>
>> >> I guessed that some integer argument must be passed wrong when using
>> >> a 64-bit machine. I recompiled the code (I use Absoft Pro Fortran 
>> >> 10.0)
>> >> and forced the default size of  integers to be 8 bytes. Lo behold my 
>> >> I/O
>> >> routine crashes at an earlier point with the following interesting
>> >> message:
>> >>
>> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in array_of_sizes[1]=0 .
>> >>
>> >> Now, all the elements of the array os fizes should be non-zero 
>> >> integers,
>> >> e.g. 64, 64, 175 . Is some information on integers being screwed up in
>> >> the
>> >> 64-bit
>> >> layout ?
>> >>
>> >> Note that after a few secs. of hanging I also get the followign:
>> >>
>> >> p0_25936: (0.089844) net_send: could not write to fd=4, errno = 32
>> >>
>> >> This is the exact same error I get when running ' make testing ' after
>> >> having installed MPICH, i.e.:
>> >>
>> >> *** Testing Type_struct from Fortran ***
>> >> Differences in structf.out
>> >> 2,7c2
>> >> < 0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does not 
>> >> fit
>> >> in
>> >> Fortran integer
>> >> < [0]  Aborting program !
>> >> < [0] Aborting program!
>> >> < p0_25936:  p4_error: : 972
>> >> < Killed by signal 2.
>> >> < p0_25936: (0.089844) net_send: could not write to fd=4, errno = 32
>> >>
>> >> Again, any help would be hugely appreciated. I'll buy you guys beers !
>> >>
>> >> Many thanks,
>> >>
>> >> Peter
>> >>
>> >>
>> >> ----- Original Message ----- 
>> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> To: <mpich-discuss at mcs.anl.gov>
>> >> Sent: Monday, May 22, 2006 2:33 PM
>> >> Subject: [MPICH] Parallel I/O problems on 64-bit machine ( please help
>> >> :-( )
>> >>
>> >>
>> >> > Hello folks,
>> >> >
>> >> > I'm writing this note to ask some help with running MPI on
>> >> > a dual proc. 64-bit Linux box I just acquired. I've written a 
>> >> > similar
>> >> > not to the mpi-bugs address but would appreciate any additional
>> >> > help from anyone else in the community.
>> >> >
>> >> > I'm using MPICH v1.2.7p1,
>> >> > which, when tested,  seems to work wonderfully with everything 
>> >> > except
>> >> > for
>> >> > some specific parallel I/O calls.
>> >> >
>> >> > Specifically, whenever there is a call to MPI_FILE_WRITE_ALL
>> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
>> >> > these I/O dumps are part of a greater CFD code which
>> >> > has worked fine on either a 32-bit dual proc. Linux workstation
>> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
>> >> >
>> >> > In  my message to mpi-bugs, I did attach a variety of files that
>> >> > could provide additional insight. In this case I'm attaching only
>> >> > the Fortran source code I can gladly provide more material
>> >> > anyone who may be interested.The troublesome Fortran call is:
>> >> >
>> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
>> >> >> MPI_REAL,
>> >> >> MPI_STATUS_IGNORE)
>> >> >
>> >> > Upon call this, the program crashes with a SIGSEGV 11 error. 
>> >> > Evidently,
>> >> > some memory is accessed out of core ?
>> >> >
>> >> > Tempout is a single precision (Real with kind=4) 3-D array, which 
>> >> > has a
>> >> > total local
>> >> > number of elements on each processor equal to local_array_size.
>> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
>> >> > status_array si appropriately dimensioned) I find that upon error,
>> >> > printing out the elements of status_array yields these huge values.
>> >> > This error always is always localized on processor (N+1)/2 (proc.
>> >> > numbering
>> >> > goes from 0 to N-1).
>> >> >
>> >> > I installed MPICH2 only to observe the same results.
>> >> > Calls to MPI_FILE_READ_ALL will also produce identical effects.
>> >> > I'll reiterate that we've never had problems with this code on 
>> >> > 32-bit
>> >> > machines.
>> >> >
>> >> > Note that uname -a returns:
>> >> >
>> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP Wed Jan 5 
>> >> > 19:29:47
>> >> > EST
>> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
>> >> >
>> >> > Am I running into problems because I've got a 64-bit configured 
>> >> > Linux
>> >> > on a
>> >> > 64-bit
>> >> > machine.
>> >> >
>> >> > Any help would HUGELY appreciated. The ability to use MPI2 parallel 
>> >> > I/O
>> >> > on
>> >> > our workstation would greatly help us crunch through some existing
>> >> > large
>> >> > datafiles
>> >> > generated on 32-bit machines.
>> >> >
>> >> > Cheers,
>> >> >
>> >> > Peter
>> >> >
>> >> > -------------------------------------------------------------
>> >> > Peter Diamessis
>> >> > Assistant Professor
>> >> > Environmental Fluid Mechanics & Hydrology
>> >> > School of Civil and Environmental Engineering
>> >> > Cornell University
>> >> > Ithaca, NY 14853
>> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
>> >> > pjd38 at cornell.edu
>> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
>> >> >
>> >> >
>> >>
>> >>
>> >
>>
>>
> 





More information about the mpich-discuss mailing list