[MPICH2 Req #2556] Re: [MPICH] Parallel I/O problems on 64-bit machine (pleasehelp:-( )

Rajeev Thakur thakur at mcs.anl.gov
Fri Jun 2 14:33:10 CDT 2006


Yes, integers are usually 4 bytes even on 64-bit machines. 64-bit means the
address space is 64 bit. sizeof(void *) in C would be 64. 

Does your program work now on IA-64? I tested on a 64-bit Sun.

Rajeev
 

> -----Original Message-----
> From: Peter Diamessis [mailto:pjd38 at cornell.edu] 
> Sent: Friday, June 02, 2006 2:20 PM
> To: Rajeev Thakur; ywang25 at aps.anl.gov
> Cc: 'Ashley Pittman'; mpich-discuss at mcs.anl.gov; 
> mpich2-maint at mcs.anl.gov
> Subject: Re: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O 
> problems on 64-bit machine (pleasehelp:-( )
> 
> By the way Rajeev,
> 
> I find that the using the -i8 option with Absoft Pro Fortran will only
> make things worse. Apparently, 4-byte integers are the 
> default even on the
> 64-bit option of this compiler.
> 
> Cheers,
> 
> Peter
> 
> ----- Original Message ----- 
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> To: "'Peter Diamessis'" <pjd38 at cornell.edu>; <ywang25 at aps.anl.gov>
> Cc: "'Ashley Pittman'" <ashley at quadrics.com>; 
> <mpich-discuss at mcs.anl.gov>; 
> <mpich2-maint at mcs.anl.gov>
> Sent: Wednesday, May 31, 2006 8:40 PM
> Subject: RE: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O 
> problems on 64-bit 
> machine (pleasehelp:-( )
> 
> 
> > Peter,
> >      The problem is a simple one, but it took me a very 
> long time to 
> > figure
> > out. You forgot to add the "ierr" parameter to 
> MPI_File_write_all :-). 
> > It's
> > a common mistake people make in Fortran programs, and I 
> should have been
> > more vigilant, but I did all sorts of debugging and 
> stripped your code 
> > down
> > to until there was nothing left in it before I noticed the missing 
> > parameter
> > :-).
> >
> > Rajeev
> >
> >
> >> -----Original Message-----
> >> From: Peter Diamessis [mailto:pjd38 at cornell.edu]
> >> Sent: Friday, May 26, 2006 4:34 PM
> >> To: ywang25 at aps.anl.gov
> >> Cc: Ashley Pittman; thakur at mcs.anl.gov;
> >> mpich-discuss at mcs.anl.gov; mpich2-maint at mcs.anl.gov
> >> Subject: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O problems
> >> on 64-bit machine (pleasehelp:-( )
> >>
> >> Hello back YoSung and Rajeev,
> >>
> >> Indeed, I did try configuring MPICH2 (and MPICH1) to accomodate
> >> 8 byte integers. When I do that I get errors with
> >> MPI_TYPE_CREATE_SUBARRAY,
> >> which state that the first element of the array of sizes is
> >> set to 0. Now,
> >> my global array
> >> has no non-zero dimension and that is confirmed by printing
> >> out gsizes(i)
> >> (i=1,...,3). Hm ?
> >>
> >> I really apologize if I'm troubling you guys with something
> >> totally simple.
> >> Following
> >> Rajeev's request, I've attached a sample program. It consists
> >> of three
> >> Fortran source
> >> codes:
> >> a) Main: the main driver.
> >> b) mpi_setup: I had originally planned to use a 2-D domain
> >> decomposition but
> >> I've ended
> >> up working with 1-D so this is more or less superfluous. It's
> >> only needed
> >> when
> >> setting up the local starting indices.
> >> c) output: The routine which gives me problems.
> >>
> >> I've attached the corresponding makefile to compile with
> >> Absoft mpif90. The
> >> file
> >> dim.h simply specifies the dimensions of some arrays, in
> >> particular the test
> >> array u(...,...,...)
> >> which is dumped out in single precision.
> >>
> >> When I run this simple code on my 32-bit machine it works w/o
> >> a problem.
> >> When I do it on the 64-bit I get the same old SIGSEGV 11 error from
> >> MPI_FILE_WRITE_ALL .
> >>
> >> Again, I hope I'm not being a hassle.
> >>
> >> Any insight on this sample code would be greatly appreciated.
> >> I still owe
> >> you folks beers :-)
> >>
> >> Cheers,
> >>
> >> Peter
> >>
> >> ----- Original Message ----- 
> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
> >> Cc: "Ashley Pittman" <ashley at quadrics.com>; <thakur at mcs.anl.gov>;
> >> <mpich-discuss at mcs.anl.gov>
> >> Sent: Friday, May 26, 2006 5:04 PM
> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine
> >> (pleasehelp:-( )
> >>
> >>
> >> > According to the manual, the Fortran standard requires that
> >> INTEGER and
> >> > REAL be the same size. The complier you used doesn't 
> confirm to the
> >> > standard, assuming the manual is still right now (i.e. the
> >> standard has
> >> > not been changed).  For your case, it seems to me you may
> >> need force the
> >> > integer to be 8 bytes when configuring MPICH2. Further
> >> more, 128 bits
> >> > for DOUBLE PRECISION quantities are required for such a 
> compiler. It
> >> > shouldn't be hard to check the size of those variables with
> >> a small test
> >> > in real time.
> >> >
> >> > These are just some of my suggestions to try. I can't 
> eliminate the
> >> > possibility of MPI_FILE_WRITE_ALL has its own problem.
> >> >
> >> > Yusong
> >> >
> >> >
> >> > On Fri, 2006-05-26 at 15:16 -0400, Peter Diamessis wrote:
> >> >> Hi folks,
> >> >>
> >> >> Well, I did read the specific question pointed out by Yosung
> >> >> in the MPICH2 manual. It seems to be that this is specific to
> >> >> the GNU F95 compiler. The Absoft F90 compiler uses a default
> >> >> 4-byte length for integers and 8-bytes for reals, i.e. 
> there is no
> >> >> such conflict. It seems to me that configuring MPICH2 with -i4
> >> >> is pretty much superfluous.
> >> >>
> >> >> Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
> >> >> and v1.2.7p1) and I get the same error. I even tried -i8
> >> for the heck
> >> >> of it and I run into a whole new suite of problems. I
> >> repeat, MPICH
> >> >> v1.2.6
> >> >> (including I/O)
> >> >> has worked beautifully for me on 32-bit machines. If I
> >> don't call my MPI
> >> >> parallel
> >> >> I/O routines, and more specifically I comment out the calls to
> >> >> MPI_FILE_WRITE_ALL
> >> >> and MPI_FILE_READ_ALL, the rest of the code works 
> perfectly fine on
> >> >> a 64-bit machine (including other MPI I/O calls).
> >> >>
> >> >> So is this what Ashley pointed out ? A bug specific to
> >> MPI_FILE_WRITE_ALL
> >> >> and MPI_FILE_READ_ALL ?
> >> >>
> >> >> Any additional feedback would be very welcome.
> >> >>
> >> >> Many thanks in advance,
> >> >>
> >> >> Peter
> >> >>
> >> >>
> >> >> ----- Original Message ----- 
> >> >> From: "Ashley Pittman" <ashley at quadrics.com>
> >> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
> >> >> Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
> >> >> Sent: Wednesday, May 24, 2006 7:07 AM
> >> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine (
> >> >> pleasehelp:-( )
> >> >>
> >> >>
> >> >> >
> >> >> > The structf failure on 64 bit machines is a bug in the
> >> spec, not a bug
> >> >> > in compiler.  In effect the spec itself isn't 64bit
> >> safe.  Following
> >> >> > down the path of the structf error will lead to a dead end.
> >> >> >
> >> >> > I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL
> >> recently, I'll
> >> >> > see
> >> >> > if I can dig up my notes about it.
> >> >> >
> >> >> > Ashley,
> >> >> >
> >> >> >
> >> >> > On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
> >> >> >> Thanks a-many YoSung,
> >> >> >>
> >> >> >> I'll contact the Absoft people to see if there is a
> >> similar issue
> >> >> >> with their F90-95 compiler. I have to be on travel tomorrow
> >> >> >> but I'll get back to this on Thursday.
> >> >> >>
> >> >> >> The pointer is much appreciated,
> >> >> >>
> >> >> >> Peter
> >> >> >>
> >> >> >> ----- Original Message ----- 
> >> >> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
> >> >> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
> >> >> >> Cc: <mpich-discuss at mcs.anl.gov>
> >> >> >> Sent: Tuesday, May 23, 2006 5:53 PM
> >> >> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit
> >> machine ( please
> >> >> >> help:-( )
> >> >> >>
> >> >> >>
> >> >> >> > You might have read this from the manual. Just in
> >> case if it could
> >> >> >> > help.
> >> >> >> >
> >> >> >> > D.4 Q: When I use the g95 Fortran compiler on a
> >> 64-bit platform,
> >> >> >> > some
> >> >> >> > of
> >> >> >> > the tests fail
> >> >> >> >
> >> >> >> > A: The g95 compiler incorrectly defines the default
> >> Fortran integer
> >> >> >> > as
> >> >> >> > a
> >> >> >> > 64- bit integer while defining Fortran reals as
> >> 32-bit values (the
> >> >> >> > Fortran standard requires that INTEGER and REAL be
> >> the same size).
> >> >> >> > This
> >> >> >> > was apparently done to allow a Fortran INTEGER to
> >> hold the value of
> >> >> >> > a
> >> >> >> > pointer, rather than requiring the programmer to
> >> select an INTEGER
> >> >> >> > of a
> >> >> >> > suitable KIND. To force the g95 compiler to correctly
> >> implement the
> >> >> >> > Fortran standard, use the -i4 flag. For example, set
> >> the environment
> >> >> >> > variable F90FLAGS before configuring MPICH2: setenv
> >> F90FLAGS "-i4"
> >> >> >> > G95
> >> >> >> > users should note that there (at this writing) are
> >> two distributions
> >> >> >> > of
> >> >> >> > g95 for 64-bit Linux platforms. One uses 32-bit
> >> integers and reals
> >> >> >> > (and
> >> >> >> > conforms to the Fortran standard) and one uses 32-bit
> >> integers and
> >> >> >> > 64-
> >> >> >> > bit reals. We recommend using the one that conforms
> >> to the standard
> >> >> >> > (note that the standard specifies the ratio of 
> sizes, not the
> >> >> >> > absolute
> >> >> >> > sizes, so a Fortran 95 compiler that used 64 bits for
> >> both INTEGER
> >> >> >> > and
> >> >> >> > REAL would also conform to the Fortran standard.
> >> However, such a
> >> >> >> > compiler would need to use 128 bits for DOUBLE PRECISION
> >> >> >> > quantities).
> >> >> >> >
> >> >> >> > Yusong
> >> >> >> >
> >> >> >> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
> >> >> >> >> Hi again,
> >> >> >> >>
> >> >> >> >> I'm still obsessing as to why MPI I/O fails on my
> >> 64-bit machine.
> >> >> >> >> I've decided to set MPICH2 aside and work with MPICH
> >> v1.2.6 which
> >> >> >> >> is the one version that worked reliably for me. This
> >> is the latest
> >> >> >> >> I
> >> >> >> >> observed.
> >> >> >> >>
> >> >> >> >> I guessed that some integer argument must be passed
> >> wrong when
> >> >> >> >> using
> >> >> >> >> a 64-bit machine. I recompiled the code (I use
> >> Absoft Pro Fortran
> >> >> >> >> 10.0)
> >> >> >> >> and forced the default size of  integers to be 8
> >> bytes. Lo behold
> >> >> >> >> my
> >> >> >> >> I/O
> >> >> >> >> routine crashes at an earlier point with the
> >> following interesting
> >> >> >> >> message:
> >> >> >> >>
> >> >> >> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in
> >> array_of_sizes[1]=0
> >> >> >> >> .
> >> >> >> >>
> >> >> >> >> Now, all the elements of the array os fizes should
> >> be non-zero
> >> >> >> >> integers,
> >> >> >> >> e.g. 64, 64, 175 . Is some information on integers
> >> being screwed up
> >> >> >> >> in
> >> >> >> >> the
> >> >> >> >> 64-bit
> >> >> >> >> layout ?
> >> >> >> >>
> >> >> >> >> Note that after a few secs. of hanging I also get
> >> the followign:
> >> >> >> >>
> >> >> >> >> p0_25936: (0.089844) net_send: could not write to
> >> fd=4, errno = 32
> >> >> >> >>
> >> >> >> >> This is the exact same error I get when running '
> >> make testing '
> >> >> >> >> after
> >> >> >> >> having installed MPICH, i.e.:
> >> >> >> >>
> >> >> >> >> *** Testing Type_struct from Fortran ***
> >> >> >> >> Differences in structf.out
> >> >> >> >> 2,7c2
> >> >> >> >> < 0 - MPI_ADDRESS : Address of location given to
> >> MPI_ADDRESS does
> >> >> >> >> not
> >> >> >> >> fit
> >> >> >> >> in
> >> >> >> >> Fortran integer
> >> >> >> >> < [0]  Aborting program !
> >> >> >> >> < [0] Aborting program!
> >> >> >> >> < p0_25936:  p4_error: : 972
> >> >> >> >> < Killed by signal 2.
> >> >> >> >> < p0_25936: (0.089844) net_send: could not write to
> >> fd=4, errno =
> >> >> >> >> 32
> >> >> >> >>
> >> >> >> >> Again, any help would be hugely appreciated. I'll
> >> buy you guys
> >> >> >> >> beers !
> >> >> >> >>
> >> >> >> >> Many thanks,
> >> >> >> >>
> >> >> >> >> Peter
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> ----- Original Message ----- 
> >> >> >> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
> >> >> >> >> To: <mpich-discuss at mcs.anl.gov>
> >> >> >> >> Sent: Monday, May 22, 2006 2:33 PM
> >> >> >> >> Subject: [MPICH] Parallel I/O problems on 64-bit
> >> machine ( please
> >> >> >> >> help
> >> >> >> >> :-( )
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> > Hello folks,
> >> >> >> >> >
> >> >> >> >> > I'm writing this note to ask some help with 
> running MPI on
> >> >> >> >> > a dual proc. 64-bit Linux box I just acquired.
> >> I've written a
> >> >> >> >> > similar
> >> >> >> >> > not to the mpi-bugs address but would appreciate
> >> any additional
> >> >> >> >> > help from anyone else in the community.
> >> >> >> >> >
> >> >> >> >> > I'm using MPICH v1.2.7p1,
> >> >> >> >> > which, when tested,  seems to work wonderfully
> >> with everything
> >> >> >> >> > except
> >> >> >> >> > for
> >> >> >> >> > some specific parallel I/O calls.
> >> >> >> >> >
> >> >> >> >> > Specifically, whenever there is a call to
> >> MPI_FILE_WRITE_ALL
> >> >> >> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
> >> >> >> >> > these I/O dumps are part of a greater CFD code which
> >> >> >> >> > has worked fine on either a 32-bit dual proc.
> >> Linux workstation
> >> >> >> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
> >> >> >> >> >
> >> >> >> >> > In  my message to mpi-bugs, I did attach a variety
> >> of files that
> >> >> >> >> > could provide additional insight. In this case I'm
> >> attaching only
> >> >> >> >> > the Fortran source code I can gladly provide 
> more material
> >> >> >> >> > anyone who may be interested.The troublesome
> >> Fortran call is:
> >> >> >> >> >
> >> >> >> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
> >> >> >> >> >> MPI_REAL,
> >> >> >> >> >> MPI_STATUS_IGNORE)
> >> >> >> >> >
> >> >> >> >> > Upon call this, the program crashes with a SIGSEGV
> >> 11 error.
> >> >> >> >> > Evidently,
> >> >> >> >> > some memory is accessed out of core ?
> >> >> >> >> >
> >> >> >> >> > Tempout is a single precision (Real with kind=4)
> >> 3-D array, which
> >> >> >> >> > has a
> >> >> >> >> > total local
> >> >> >> >> > number of elements on each processor equal to
> >> local_array_size.
> >> >> >> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
> >> >> >> >> > status_array si appropriately dimensioned) I find
> >> that upon
> >> >> >> >> > error,
> >> >> >> >> > printing out the elements of status_array yields
> >> these huge
> >> >> >> >> > values.
> >> >> >> >> > This error always is always localized on processor
> >> (N+1)/2 (proc.
> >> >> >> >> > numbering
> >> >> >> >> > goes from 0 to N-1).
> >> >> >> >> >
> >> >> >> >> > I installed MPICH2 only to observe the same results.
> >> >> >> >> > Calls to MPI_FILE_READ_ALL will also produce
> >> identical effects.
> >> >> >> >> > I'll reiterate that we've never had problems with
> >> this code on
> >> >> >> >> > 32-bit
> >> >> >> >> > machines.
> >> >> >> >> >
> >> >> >> >> > Note that uname -a returns:
> >> >> >> >> >
> >> >> >> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP
> >> Wed Jan 5
> >> >> >> >> > 19:29:47
> >> >> >> >> > EST
> >> >> >> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
> >> >> >> >> >
> >> >> >> >> > Am I running into problems because I've got a
> >> 64-bit configured
> >> >> >> >> > Linux
> >> >> >> >> > on a
> >> >> >> >> > 64-bit
> >> >> >> >> > machine.
> >> >> >> >> >
> >> >> >> >> > Any help would HUGELY appreciated. The ability 
> to use MPI2
> >> >> >> >> > parallel
> >> >> >> >> > I/O
> >> >> >> >> > on
> >> >> >> >> > our workstation would greatly help us crunch 
> through some
> >> >> >> >> > existing
> >> >> >> >> > large
> >> >> >> >> > datafiles
> >> >> >> >> > generated on 32-bit machines.
> >> >> >> >> >
> >> >> >> >> > Cheers,
> >> >> >> >> >
> >> >> >> >> > Peter
> >> >> >> >> >
> >> >> >> >> >
> >> -------------------------------------------------------------
> >> >> >> >> > Peter Diamessis
> >> >> >> >> > Assistant Professor
> >> >> >> >> > Environmental Fluid Mechanics & Hydrology
> >> >> >> >> > School of Civil and Environmental Engineering
> >> >> >> >> > Cornell University
> >> >> >> >> > Ithaca, NY 14853
> >> >> >> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
> >> >> >> >> > pjd38 at cornell.edu
> >> >> >> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >
> >>
> > 
> 
> 
> 




More information about the mpich-discuss mailing list