[MPICH] Parallel I/O problems on 64-bit machine (pleasehelp:-( )

Rajeev Thakur thakur at mcs.anl.gov
Fri May 26 16:39:25 CDT 2006


Did you reconfigure MPICH2 to use 8-byte integers (ie set the environment
variable, run configure, make) or run recompile your code with the option
for 8-byte integers?

Rajeev 

> -----Original Message-----
> From: Peter Diamessis [mailto:pjd38 at cornell.edu] 
> Sent: Friday, May 26, 2006 4:34 PM
> To: ywang25 at aps.anl.gov
> Cc: Ashley Pittman; thakur at mcs.anl.gov; mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine 
> (pleasehelp:-( )
> 
> Hello back YoSung and Rajeev,
> 
> Indeed, I did try configuring MPICH2 (and MPICH1) to accomodate
> 8 byte integers. When I do that I get errors with 
> MPI_TYPE_CREATE_SUBARRAY,
> which state that the first element of the array of sizes is 
> set to 0. Now, 
> my global array
> has no non-zero dimension and that is confirmed by printing 
> out gsizes(i) 
> (i=1,...,3). Hm ?
> 
> I really apologize if I'm troubling you guys with something 
> totally simple. 
> Following
> Rajeev's request, I've attached a sample program. It consists 
> of three 
> Fortran source
> codes:
> a) Main: the main driver.
> b) mpi_setup: I had originally planned to use a 2-D domain 
> decomposition but 
> I've ended
> up working with 1-D so this is more or less superfluous. It's 
> only needed 
> when
> setting up the local starting indices.
> c) output: The routine which gives me problems.
> 
> I've attached the corresponding makefile to compile with 
> Absoft mpif90. The 
> file
> dim.h simply specifies the dimensions of some arrays, in 
> particular the test 
> array u(...,...,...)
> which is dumped out in single precision.
> 
> When I run this simple code on my 32-bit machine it works w/o 
> a problem.
> When I do it on the 64-bit I get the same old SIGSEGV 11 error from 
> MPI_FILE_WRITE_ALL .
> 
> Again, I hope I'm not being a hassle.
> 
> Any insight on this sample code would be greatly appreciated. 
> I still owe 
> you folks beers :-)
> 
> Cheers,
> 
> Peter
> 
> ----- Original Message ----- 
> From: "Yusong Wang" <ywang25 at aps.anl.gov>
> To: "Peter Diamessis" <pjd38 at cornell.edu>
> Cc: "Ashley Pittman" <ashley at quadrics.com>; <thakur at mcs.anl.gov>; 
> <mpich-discuss at mcs.anl.gov>
> Sent: Friday, May 26, 2006 5:04 PM
> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine 
> (pleasehelp:-( )
> 
> 
> > According to the manual, the Fortran standard requires that 
> INTEGER and
> > REAL be the same size. The complier you used doesn't confirm to the
> > standard, assuming the manual is still right now (i.e. the 
> standard has
> > not been changed).  For your case, it seems to me you may 
> need force the
> > integer to be 8 bytes when configuring MPICH2. Further 
> more, 128 bits
> > for DOUBLE PRECISION quantities are required for such a compiler. It
> > shouldn't be hard to check the size of those variables with 
> a small test
> > in real time.
> >
> > These are just some of my suggestions to try. I can't eliminate the
> > possibility of MPI_FILE_WRITE_ALL has its own problem.
> >
> > Yusong
> >
> >
> > On Fri, 2006-05-26 at 15:16 -0400, Peter Diamessis wrote:
> >> Hi folks,
> >>
> >> Well, I did read the specific question pointed out by Yosung
> >> in the MPICH2 manual. It seems to be that this is specific to
> >> the GNU F95 compiler. The Absoft F90 compiler uses a default
> >> 4-byte length for integers and 8-bytes for reals, i.e. there is no
> >> such conflict. It seems to me that configuring MPICH2 with -i4
> >> is pretty much superfluous.
> >>
> >> Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
> >> and v1.2.7p1) and I get the same error. I even tried -i8 
> for the heck
> >> of it and I run into a whole new suite of problems. I 
> repeat, MPICH 
> >> v1.2.6
> >> (including I/O)
> >> has worked beautifully for me on 32-bit machines. If I 
> don't call my MPI
> >> parallel
> >> I/O routines, and more specifically I comment out the calls to
> >> MPI_FILE_WRITE_ALL
> >> and MPI_FILE_READ_ALL, the rest of the code works perfectly fine on
> >> a 64-bit machine (including other MPI I/O calls).
> >>
> >> So is this what Ashley pointed out ? A bug specific to 
> MPI_FILE_WRITE_ALL
> >> and MPI_FILE_READ_ALL ?
> >>
> >> Any additional feedback would be very welcome.
> >>
> >> Many thanks in advance,
> >>
> >> Peter
> >>
> >>
> >> ----- Original Message ----- 
> >> From: "Ashley Pittman" <ashley at quadrics.com>
> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
> >> Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
> >> Sent: Wednesday, May 24, 2006 7:07 AM
> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine (
> >> pleasehelp:-( )
> >>
> >>
> >> >
> >> > The structf failure on 64 bit machines is a bug in the 
> spec, not a bug
> >> > in compiler.  In effect the spec itself isn't 64bit 
> safe.  Following
> >> > down the path of the structf error will lead to a dead end.
> >> >
> >> > I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL 
> recently, I'll 
> >> > see
> >> > if I can dig up my notes about it.
> >> >
> >> > Ashley,
> >> >
> >> >
> >> > On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
> >> >> Thanks a-many YoSung,
> >> >>
> >> >> I'll contact the Absoft people to see if there is a 
> similar issue
> >> >> with their F90-95 compiler. I have to be on travel tomorrow
> >> >> but I'll get back to this on Thursday.
> >> >>
> >> >> The pointer is much appreciated,
> >> >>
> >> >> Peter
> >> >>
> >> >> ----- Original Message ----- 
> >> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
> >> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
> >> >> Cc: <mpich-discuss at mcs.anl.gov>
> >> >> Sent: Tuesday, May 23, 2006 5:53 PM
> >> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit 
> machine ( please
> >> >> help:-( )
> >> >>
> >> >>
> >> >> > You might have read this from the manual. Just in 
> case if it could
> >> >> > help.
> >> >> >
> >> >> > D.4 Q: When I use the g95 Fortran compiler on a 
> 64-bit platform, 
> >> >> > some
> >> >> > of
> >> >> > the tests fail
> >> >> >
> >> >> > A: The g95 compiler incorrectly defines the default 
> Fortran integer 
> >> >> > as
> >> >> > a
> >> >> > 64- bit integer while defining Fortran reals as 
> 32-bit values (the
> >> >> > Fortran standard requires that INTEGER and REAL be 
> the same size). 
> >> >> > This
> >> >> > was apparently done to allow a Fortran INTEGER to 
> hold the value of 
> >> >> > a
> >> >> > pointer, rather than requiring the programmer to 
> select an INTEGER 
> >> >> > of a
> >> >> > suitable KIND. To force the g95 compiler to correctly 
> implement the
> >> >> > Fortran standard, use the -i4 flag. For example, set 
> the environment
> >> >> > variable F90FLAGS before configuring MPICH2: setenv 
> F90FLAGS "-i4" 
> >> >> > G95
> >> >> > users should note that there (at this writing) are 
> two distributions 
> >> >> > of
> >> >> > g95 for 64-bit Linux platforms. One uses 32-bit 
> integers and reals 
> >> >> > (and
> >> >> > conforms to the Fortran standard) and one uses 32-bit 
> integers and 
> >> >> > 64-
> >> >> > bit reals. We recommend using the one that conforms 
> to the standard
> >> >> > (note that the standard specifies the ratio of sizes, not the 
> >> >> > absolute
> >> >> > sizes, so a Fortran 95 compiler that used 64 bits for 
> both INTEGER 
> >> >> > and
> >> >> > REAL would also conform to the Fortran standard. 
> However, such a
> >> >> > compiler would need to use 128 bits for DOUBLE PRECISION 
> >> >> > quantities).
> >> >> >
> >> >> > Yusong
> >> >> >
> >> >> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
> >> >> >> Hi again,
> >> >> >>
> >> >> >> I'm still obsessing as to why MPI I/O fails on my 
> 64-bit machine.
> >> >> >> I've decided to set MPICH2 aside and work with MPICH 
> v1.2.6 which
> >> >> >> is the one version that worked reliably for me. This 
> is the latest 
> >> >> >> I
> >> >> >> observed.
> >> >> >>
> >> >> >> I guessed that some integer argument must be passed 
> wrong when 
> >> >> >> using
> >> >> >> a 64-bit machine. I recompiled the code (I use 
> Absoft Pro Fortran
> >> >> >> 10.0)
> >> >> >> and forced the default size of  integers to be 8 
> bytes. Lo behold 
> >> >> >> my
> >> >> >> I/O
> >> >> >> routine crashes at an earlier point with the 
> following interesting
> >> >> >> message:
> >> >> >>
> >> >> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in 
> array_of_sizes[1]=0 
> >> >> >> .
> >> >> >>
> >> >> >> Now, all the elements of the array os fizes should 
> be non-zero
> >> >> >> integers,
> >> >> >> e.g. 64, 64, 175 . Is some information on integers 
> being screwed up 
> >> >> >> in
> >> >> >> the
> >> >> >> 64-bit
> >> >> >> layout ?
> >> >> >>
> >> >> >> Note that after a few secs. of hanging I also get 
> the followign:
> >> >> >>
> >> >> >> p0_25936: (0.089844) net_send: could not write to 
> fd=4, errno = 32
> >> >> >>
> >> >> >> This is the exact same error I get when running ' 
> make testing ' 
> >> >> >> after
> >> >> >> having installed MPICH, i.e.:
> >> >> >>
> >> >> >> *** Testing Type_struct from Fortran ***
> >> >> >> Differences in structf.out
> >> >> >> 2,7c2
> >> >> >> < 0 - MPI_ADDRESS : Address of location given to 
> MPI_ADDRESS does 
> >> >> >> not
> >> >> >> fit
> >> >> >> in
> >> >> >> Fortran integer
> >> >> >> < [0]  Aborting program !
> >> >> >> < [0] Aborting program!
> >> >> >> < p0_25936:  p4_error: : 972
> >> >> >> < Killed by signal 2.
> >> >> >> < p0_25936: (0.089844) net_send: could not write to 
> fd=4, errno = 
> >> >> >> 32
> >> >> >>
> >> >> >> Again, any help would be hugely appreciated. I'll 
> buy you guys 
> >> >> >> beers !
> >> >> >>
> >> >> >> Many thanks,
> >> >> >>
> >> >> >> Peter
> >> >> >>
> >> >> >>
> >> >> >> ----- Original Message ----- 
> >> >> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
> >> >> >> To: <mpich-discuss at mcs.anl.gov>
> >> >> >> Sent: Monday, May 22, 2006 2:33 PM
> >> >> >> Subject: [MPICH] Parallel I/O problems on 64-bit 
> machine ( please 
> >> >> >> help
> >> >> >> :-( )
> >> >> >>
> >> >> >>
> >> >> >> > Hello folks,
> >> >> >> >
> >> >> >> > I'm writing this note to ask some help with running MPI on
> >> >> >> > a dual proc. 64-bit Linux box I just acquired. 
> I've written a
> >> >> >> > similar
> >> >> >> > not to the mpi-bugs address but would appreciate 
> any additional
> >> >> >> > help from anyone else in the community.
> >> >> >> >
> >> >> >> > I'm using MPICH v1.2.7p1,
> >> >> >> > which, when tested,  seems to work wonderfully 
> with everything
> >> >> >> > except
> >> >> >> > for
> >> >> >> > some specific parallel I/O calls.
> >> >> >> >
> >> >> >> > Specifically, whenever there is a call to 
> MPI_FILE_WRITE_ALL
> >> >> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
> >> >> >> > these I/O dumps are part of a greater CFD code which
> >> >> >> > has worked fine on either a 32-bit dual proc. 
> Linux workstation
> >> >> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
> >> >> >> >
> >> >> >> > In  my message to mpi-bugs, I did attach a variety 
> of files that
> >> >> >> > could provide additional insight. In this case I'm 
> attaching only
> >> >> >> > the Fortran source code I can gladly provide more material
> >> >> >> > anyone who may be interested.The troublesome 
> Fortran call is:
> >> >> >> >
> >> >> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
> >> >> >> >> MPI_REAL,
> >> >> >> >> MPI_STATUS_IGNORE)
> >> >> >> >
> >> >> >> > Upon call this, the program crashes with a SIGSEGV 
> 11 error.
> >> >> >> > Evidently,
> >> >> >> > some memory is accessed out of core ?
> >> >> >> >
> >> >> >> > Tempout is a single precision (Real with kind=4) 
> 3-D array, which
> >> >> >> > has a
> >> >> >> > total local
> >> >> >> > number of elements on each processor equal to 
> local_array_size.
> >> >> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
> >> >> >> > status_array si appropriately dimensioned) I find 
> that upon 
> >> >> >> > error,
> >> >> >> > printing out the elements of status_array yields 
> these huge 
> >> >> >> > values.
> >> >> >> > This error always is always localized on processor 
> (N+1)/2 (proc.
> >> >> >> > numbering
> >> >> >> > goes from 0 to N-1).
> >> >> >> >
> >> >> >> > I installed MPICH2 only to observe the same results.
> >> >> >> > Calls to MPI_FILE_READ_ALL will also produce 
> identical effects.
> >> >> >> > I'll reiterate that we've never had problems with 
> this code on
> >> >> >> > 32-bit
> >> >> >> > machines.
> >> >> >> >
> >> >> >> > Note that uname -a returns:
> >> >> >> >
> >> >> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP 
> Wed Jan 5
> >> >> >> > 19:29:47
> >> >> >> > EST
> >> >> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
> >> >> >> >
> >> >> >> > Am I running into problems because I've got a 
> 64-bit configured
> >> >> >> > Linux
> >> >> >> > on a
> >> >> >> > 64-bit
> >> >> >> > machine.
> >> >> >> >
> >> >> >> > Any help would HUGELY appreciated. The ability to use MPI2 
> >> >> >> > parallel
> >> >> >> > I/O
> >> >> >> > on
> >> >> >> > our workstation would greatly help us crunch through some 
> >> >> >> > existing
> >> >> >> > large
> >> >> >> > datafiles
> >> >> >> > generated on 32-bit machines.
> >> >> >> >
> >> >> >> > Cheers,
> >> >> >> >
> >> >> >> > Peter
> >> >> >> >
> >> >> >> > 
> -------------------------------------------------------------
> >> >> >> > Peter Diamessis
> >> >> >> > Assistant Professor
> >> >> >> > Environmental Fluid Mechanics & Hydrology
> >> >> >> > School of Civil and Environmental Engineering
> >> >> >> > Cornell University
> >> >> >> > Ithaca, NY 14853
> >> >> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
> >> >> >> > pjd38 at cornell.edu
> >> >> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >
> >>
> >>
> > 
> 




More information about the mpich-discuss mailing list