[MPICH] Parallel I/O problems on 64-bit machine ( please help:-( )

Ashley Pittman ashley at quadrics.com
Wed May 24 06:07:36 CDT 2006


The structf failure on 64 bit machines is a bug in the spec, not a bug
in compiler.  In effect the spec itself isn't 64bit safe.  Following
down the path of the structf error will lead to a dead end.

I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL recently, I'll see
if I can dig up my notes about it.

Ashley,


On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
> Thanks a-many YoSung,
> 
> I'll contact the Absoft people to see if there is a similar issue
> with their F90-95 compiler. I have to be on travel tomorrow
> but I'll get back to this on Thursday.
> 
> The pointer is much appreciated,
> 
> Peter
> 
> ----- Original Message ----- 
> From: "Yusong Wang" <ywang25 at aps.anl.gov>
> To: "Peter Diamessis" <pjd38 at cornell.edu>
> Cc: <mpich-discuss at mcs.anl.gov>
> Sent: Tuesday, May 23, 2006 5:53 PM
> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine ( please 
> help:-( )
> 
> 
> > You might have read this from the manual. Just in case if it could help.
> >
> > D.4 Q: When I use the g95 Fortran compiler on a 64-bit platform, some of
> > the tests fail
> >
> > A: The g95 compiler incorrectly defines the default Fortran integer as a
> > 64- bit integer while defining Fortran reals as 32-bit values (the
> > Fortran standard requires that INTEGER and REAL be the same size). This
> > was apparently done to allow a Fortran INTEGER to hold the value of a
> > pointer, rather than requiring the programmer to select an INTEGER of a
> > suitable KIND. To force the g95 compiler to correctly implement the
> > Fortran standard, use the -i4 flag. For example, set the environment
> > variable F90FLAGS before configuring MPICH2: setenv F90FLAGS "-i4" G95
> > users should note that there (at this writing) are two distributions of
> > g95 for 64-bit Linux platforms. One uses 32-bit integers and reals (and
> > conforms to the Fortran standard) and one uses 32-bit integers and 64-
> > bit reals. We recommend using the one that conforms to the standard
> > (note that the standard specifies the ratio of sizes, not the absolute
> > sizes, so a Fortran 95 compiler that used 64 bits for both INTEGER and
> > REAL would also conform to the Fortran standard. However, such a
> > compiler would need to use 128 bits for DOUBLE PRECISION quantities).
> >
> > Yusong
> >
> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
> >> Hi again,
> >>
> >> I'm still obsessing as to why MPI I/O fails on my 64-bit machine.
> >> I've decided to set MPICH2 aside and work with MPICH v1.2.6 which
> >> is the one version that worked reliably for me. This is the latest I
> >> observed.
> >>
> >> I guessed that some integer argument must be passed wrong when using
> >> a 64-bit machine. I recompiled the code (I use Absoft Pro Fortran 10.0)
> >> and forced the default size of  integers to be 8 bytes. Lo behold my I/O
> >> routine crashes at an earlier point with the following interesting 
> >> message:
> >>
> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in array_of_sizes[1]=0 .
> >>
> >> Now, all the elements of the array os fizes should be non-zero integers,
> >> e.g. 64, 64, 175 . Is some information on integers being screwed up in 
> >> the
> >> 64-bit
> >> layout ?
> >>
> >> Note that after a few secs. of hanging I also get the followign:
> >>
> >> p0_25936: (0.089844) net_send: could not write to fd=4, errno = 32
> >>
> >> This is the exact same error I get when running ' make testing ' after
> >> having installed MPICH, i.e.:
> >>
> >> *** Testing Type_struct from Fortran ***
> >> Differences in structf.out
> >> 2,7c2
> >> < 0 - MPI_ADDRESS : Address of location given to MPI_ADDRESS does not fit 
> >> in
> >> Fortran integer
> >> < [0]  Aborting program !
> >> < [0] Aborting program!
> >> < p0_25936:  p4_error: : 972
> >> < Killed by signal 2.
> >> < p0_25936: (0.089844) net_send: could not write to fd=4, errno = 32
> >>
> >> Again, any help would be hugely appreciated. I'll buy you guys beers !
> >>
> >> Many thanks,
> >>
> >> Peter
> >>
> >>
> >> ----- Original Message ----- 
> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
> >> To: <mpich-discuss at mcs.anl.gov>
> >> Sent: Monday, May 22, 2006 2:33 PM
> >> Subject: [MPICH] Parallel I/O problems on 64-bit machine ( please help 
> >> :-( )
> >>
> >>
> >> > Hello folks,
> >> >
> >> > I'm writing this note to ask some help with running MPI on
> >> > a dual proc. 64-bit Linux box I just acquired. I've written a similar
> >> > not to the mpi-bugs address but would appreciate any additional
> >> > help from anyone else in the community.
> >> >
> >> > I'm using MPICH v1.2.7p1,
> >> > which, when tested,  seems to work wonderfully with everything except 
> >> > for
> >> > some specific parallel I/O calls.
> >> >
> >> > Specifically, whenever there is a call to MPI_FILE_WRITE_ALL
> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
> >> > these I/O dumps are part of a greater CFD code which
> >> > has worked fine on either a 32-bit dual proc. Linux workstation
> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
> >> >
> >> > In  my message to mpi-bugs, I did attach a variety of files that
> >> > could provide additional insight. In this case I'm attaching only
> >> > the Fortran source code I can gladly provide more material
> >> > anyone who may be interested.The troublesome Fortran call is:
> >> >
> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
> >> >> MPI_REAL,
> >> >> MPI_STATUS_IGNORE)
> >> >
> >> > Upon call this, the program crashes with a SIGSEGV 11 error. Evidently,
> >> > some memory is accessed out of core ?
> >> >
> >> > Tempout is a single precision (Real with kind=4) 3-D array, which has a
> >> > total local
> >> > number of elements on each processor equal to local_array_size.
> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
> >> > status_array si appropriately dimensioned) I find that upon error,
> >> > printing out the elements of status_array yields these huge values.
> >> > This error always is always localized on processor (N+1)/2 (proc.
> >> > numbering
> >> > goes from 0 to N-1).
> >> >
> >> > I installed MPICH2 only to observe the same results.
> >> > Calls to MPI_FILE_READ_ALL will also produce identical effects.
> >> > I'll reiterate that we've never had problems with this code on 32-bit
> >> > machines.
> >> >
> >> > Note that uname -a returns:
> >> >
> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 
> >> > EST
> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
> >> >
> >> > Am I running into problems because I've got a 64-bit configured Linux 
> >> > on a
> >> > 64-bit
> >> > machine.
> >> >
> >> > Any help would HUGELY appreciated. The ability to use MPI2 parallel I/O 
> >> > on
> >> > our workstation would greatly help us crunch through some existing 
> >> > large
> >> > datafiles
> >> > generated on 32-bit machines.
> >> >
> >> > Cheers,
> >> >
> >> > Peter
> >> >
> >> > -------------------------------------------------------------
> >> > Peter Diamessis
> >> > Assistant Professor
> >> > Environmental Fluid Mechanics & Hydrology
> >> > School of Civil and Environmental Engineering
> >> > Cornell University
> >> > Ithaca, NY 14853
> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
> >> > pjd38 at cornell.edu
> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
> >> >
> >> >
> >>
> >>
> > 
> 
> 




More information about the mpich-discuss mailing list