[MPICH] Parallel I/O problems on 64-bit machine ( pleasehelp:-( )
Rajeev Thakur
thakur at mcs.anl.gov
Fri May 26 14:45:46 CDT 2006
Do you have a small test program you could send us?
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Peter Diamessis
> Sent: Friday, May 26, 2006 2:16 PM
> To: Ashley Pittman
> Cc: ywang25 at aps.anl.gov; mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine
> ( pleasehelp:-( )
>
> Hi folks,
>
> Well, I did read the specific question pointed out by Yosung
> in the MPICH2 manual. It seems to be that this is specific to
> the GNU F95 compiler. The Absoft F90 compiler uses a default
> 4-byte length for integers and 8-bytes for reals, i.e. there is no
> such conflict. It seems to me that configuring MPICH2 with -i4
> is pretty much superfluous.
>
> Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
> and v1.2.7p1) and I get the same error. I even tried -i8 for the heck
> of it and I run into a whole new suite of problems. I repeat,
> MPICH v1.2.6
> (including I/O)
> has worked beautifully for me on 32-bit machines. If I don't
> call my MPI
> parallel
> I/O routines, and more specifically I comment out the calls to
> MPI_FILE_WRITE_ALL
> and MPI_FILE_READ_ALL, the rest of the code works perfectly fine on
> a 64-bit machine (including other MPI I/O calls).
>
> So is this what Ashley pointed out ? A bug specific to
> MPI_FILE_WRITE_ALL
> and MPI_FILE_READ_ALL ?
>
> Any additional feedback would be very welcome.
>
> Many thanks in advance,
>
> Peter
>
>
> ----- Original Message -----
> From: "Ashley Pittman" <ashley at quadrics.com>
> To: "Peter Diamessis" <pjd38 at cornell.edu>
> Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
> Sent: Wednesday, May 24, 2006 7:07 AM
> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine (
> pleasehelp:-( )
>
>
> >
> > The structf failure on 64 bit machines is a bug in the
> spec, not a bug
> > in compiler. In effect the spec itself isn't 64bit safe. Following
> > down the path of the structf error will lead to a dead end.
> >
> > I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL
> recently, I'll see
> > if I can dig up my notes about it.
> >
> > Ashley,
> >
> >
> > On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
> >> Thanks a-many YoSung,
> >>
> >> I'll contact the Absoft people to see if there is a similar issue
> >> with their F90-95 compiler. I have to be on travel tomorrow
> >> but I'll get back to this on Thursday.
> >>
> >> The pointer is much appreciated,
> >>
> >> Peter
> >>
> >> ----- Original Message -----
> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
> >> Cc: <mpich-discuss at mcs.anl.gov>
> >> Sent: Tuesday, May 23, 2006 5:53 PM
> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit
> machine ( please
> >> help:-( )
> >>
> >>
> >> > You might have read this from the manual. Just in case
> if it could
> >> > help.
> >> >
> >> > D.4 Q: When I use the g95 Fortran compiler on a 64-bit
> platform, some
> >> > of
> >> > the tests fail
> >> >
> >> > A: The g95 compiler incorrectly defines the default
> Fortran integer as
> >> > a
> >> > 64- bit integer while defining Fortran reals as 32-bit
> values (the
> >> > Fortran standard requires that INTEGER and REAL be the
> same size). This
> >> > was apparently done to allow a Fortran INTEGER to hold
> the value of a
> >> > pointer, rather than requiring the programmer to select
> an INTEGER of a
> >> > suitable KIND. To force the g95 compiler to correctly
> implement the
> >> > Fortran standard, use the -i4 flag. For example, set the
> environment
> >> > variable F90FLAGS before configuring MPICH2: setenv
> F90FLAGS "-i4" G95
> >> > users should note that there (at this writing) are two
> distributions of
> >> > g95 for 64-bit Linux platforms. One uses 32-bit integers
> and reals (and
> >> > conforms to the Fortran standard) and one uses 32-bit
> integers and 64-
> >> > bit reals. We recommend using the one that conforms to
> the standard
> >> > (note that the standard specifies the ratio of sizes,
> not the absolute
> >> > sizes, so a Fortran 95 compiler that used 64 bits for
> both INTEGER and
> >> > REAL would also conform to the Fortran standard. However, such a
> >> > compiler would need to use 128 bits for DOUBLE PRECISION
> quantities).
> >> >
> >> > Yusong
> >> >
> >> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
> >> >> Hi again,
> >> >>
> >> >> I'm still obsessing as to why MPI I/O fails on my
> 64-bit machine.
> >> >> I've decided to set MPICH2 aside and work with MPICH
> v1.2.6 which
> >> >> is the one version that worked reliably for me. This is
> the latest I
> >> >> observed.
> >> >>
> >> >> I guessed that some integer argument must be passed
> wrong when using
> >> >> a 64-bit machine. I recompiled the code (I use Absoft
> Pro Fortran
> >> >> 10.0)
> >> >> and forced the default size of integers to be 8 bytes.
> Lo behold my
> >> >> I/O
> >> >> routine crashes at an earlier point with the following
> interesting
> >> >> message:
> >> >>
> >> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in
> array_of_sizes[1]=0 .
> >> >>
> >> >> Now, all the elements of the array os fizes should be non-zero
> >> >> integers,
> >> >> e.g. 64, 64, 175 . Is some information on integers
> being screwed up in
> >> >> the
> >> >> 64-bit
> >> >> layout ?
> >> >>
> >> >> Note that after a few secs. of hanging I also get the followign:
> >> >>
> >> >> p0_25936: (0.089844) net_send: could not write to fd=4,
> errno = 32
> >> >>
> >> >> This is the exact same error I get when running ' make
> testing ' after
> >> >> having installed MPICH, i.e.:
> >> >>
> >> >> *** Testing Type_struct from Fortran ***
> >> >> Differences in structf.out
> >> >> 2,7c2
> >> >> < 0 - MPI_ADDRESS : Address of location given to
> MPI_ADDRESS does not
> >> >> fit
> >> >> in
> >> >> Fortran integer
> >> >> < [0] Aborting program !
> >> >> < [0] Aborting program!
> >> >> < p0_25936: p4_error: : 972
> >> >> < Killed by signal 2.
> >> >> < p0_25936: (0.089844) net_send: could not write to
> fd=4, errno = 32
> >> >>
> >> >> Again, any help would be hugely appreciated. I'll buy
> you guys beers !
> >> >>
> >> >> Many thanks,
> >> >>
> >> >> Peter
> >> >>
> >> >>
> >> >> ----- Original Message -----
> >> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
> >> >> To: <mpich-discuss at mcs.anl.gov>
> >> >> Sent: Monday, May 22, 2006 2:33 PM
> >> >> Subject: [MPICH] Parallel I/O problems on 64-bit
> machine ( please help
> >> >> :-( )
> >> >>
> >> >>
> >> >> > Hello folks,
> >> >> >
> >> >> > I'm writing this note to ask some help with running MPI on
> >> >> > a dual proc. 64-bit Linux box I just acquired. I've written a
> >> >> > similar
> >> >> > not to the mpi-bugs address but would appreciate any
> additional
> >> >> > help from anyone else in the community.
> >> >> >
> >> >> > I'm using MPICH v1.2.7p1,
> >> >> > which, when tested, seems to work wonderfully with
> everything
> >> >> > except
> >> >> > for
> >> >> > some specific parallel I/O calls.
> >> >> >
> >> >> > Specifically, whenever there is a call to MPI_FILE_WRITE_ALL
> >> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
> >> >> > these I/O dumps are part of a greater CFD code which
> >> >> > has worked fine on either a 32-bit dual proc. Linux
> workstation
> >> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
> >> >> >
> >> >> > In my message to mpi-bugs, I did attach a variety of
> files that
> >> >> > could provide additional insight. In this case I'm
> attaching only
> >> >> > the Fortran source code I can gladly provide more material
> >> >> > anyone who may be interested.The troublesome Fortran call is:
> >> >> >
> >> >> > call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
> >> >> >> MPI_REAL,
> >> >> >> MPI_STATUS_IGNORE)
> >> >> >
> >> >> > Upon call this, the program crashes with a SIGSEGV 11 error.
> >> >> > Evidently,
> >> >> > some memory is accessed out of core ?
> >> >> >
> >> >> > Tempout is a single precision (Real with kind=4) 3-D
> array, which
> >> >> > has a
> >> >> > total local
> >> >> > number of elements on each processor equal to
> local_array_size.
> >> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
> >> >> > status_array si appropriately dimensioned) I find
> that upon error,
> >> >> > printing out the elements of status_array yields
> these huge values.
> >> >> > This error always is always localized on processor
> (N+1)/2 (proc.
> >> >> > numbering
> >> >> > goes from 0 to N-1).
> >> >> >
> >> >> > I installed MPICH2 only to observe the same results.
> >> >> > Calls to MPI_FILE_READ_ALL will also produce
> identical effects.
> >> >> > I'll reiterate that we've never had problems with
> this code on
> >> >> > 32-bit
> >> >> > machines.
> >> >> >
> >> >> > Note that uname -a returns:
> >> >> >
> >> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP Wed Jan 5
> >> >> > 19:29:47
> >> >> > EST
> >> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
> >> >> >
> >> >> > Am I running into problems because I've got a 64-bit
> configured
> >> >> > Linux
> >> >> > on a
> >> >> > 64-bit
> >> >> > machine.
> >> >> >
> >> >> > Any help would HUGELY appreciated. The ability to use
> MPI2 parallel
> >> >> > I/O
> >> >> > on
> >> >> > our workstation would greatly help us crunch through
> some existing
> >> >> > large
> >> >> > datafiles
> >> >> > generated on 32-bit machines.
> >> >> >
> >> >> > Cheers,
> >> >> >
> >> >> > Peter
> >> >> >
> >> >> > -------------------------------------------------------------
> >> >> > Peter Diamessis
> >> >> > Assistant Professor
> >> >> > Environmental Fluid Mechanics & Hydrology
> >> >> > School of Civil and Environmental Engineering
> >> >> > Cornell University
> >> >> > Ithaca, NY 14853
> >> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
> >> >> > pjd38 at cornell.edu
> >> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >
> >>
> >>
> >
>
>
>
More information about the mpich-discuss
mailing list