[MPICH2 Req #2556] Re: [MPICH] Parallel I/O problems on 64-bit machine (pleasehelp:-( )

Fri Jun 2 14:44:02 CDT 2006

Hi back Rajeev,

Indeed it does work on IA-64. Where it's crashing is some segmentation fault 
problem
associated with data transposition across processors when performing
2-D FFTs. This only happens when "resolution/processor" exceeds a certain
limit. Apparently, the size of my messages is too large for the default 
buffer so
I should be able to work around this :-)

Thanks again for your advice and patience,

Peter

----- Original Message ----- 
From: "Rajeev Thakur" <thakur at mcs.anl.gov>
To: "'Peter Diamessis'" <pjd38 at cornell.edu>; <ywang25 at aps.anl.gov>
Cc: "'Ashley Pittman'" <ashley at quadrics.com>; <mpich-discuss at mcs.anl.gov>; 
<mpich2-maint at mcs.anl.gov>
Sent: Friday, June 02, 2006 3:33 PM
Subject: RE: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O problems on 64-bit 
machine (pleasehelp:-( )

> Yes, integers are usually 4 bytes even on 64-bit machines. 64-bit means 
> the
> address space is 64 bit. sizeof(void *) in C would be 64.
>
> Does your program work now on IA-64? I tested on a 64-bit Sun.
>
> Rajeev
>
>
>> -----Original Message-----
>> From: Peter Diamessis [mailto:pjd38 at cornell.edu]
>> Sent: Friday, June 02, 2006 2:20 PM
>> To: Rajeev Thakur; ywang25 at aps.anl.gov
>> Cc: 'Ashley Pittman'; mpich-discuss at mcs.anl.gov;
>> mpich2-maint at mcs.anl.gov
>> Subject: Re: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O
>> problems on 64-bit machine (pleasehelp:-( )
>>
>> By the way Rajeev,
>>
>> I find that the using the -i8 option with Absoft Pro Fortran will only
>> make things worse. Apparently, 4-byte integers are the
>> default even on the
>> 64-bit option of this compiler.
>>
>> Cheers,
>>
>> Peter
>>
>> ----- Original Message ----- 
>> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
>> To: "'Peter Diamessis'" <pjd38 at cornell.edu>; <ywang25 at aps.anl.gov>
>> Cc: "'Ashley Pittman'" <ashley at quadrics.com>;
>> <mpich-discuss at mcs.anl.gov>;
>> <mpich2-maint at mcs.anl.gov>
>> Sent: Wednesday, May 31, 2006 8:40 PM
>> Subject: RE: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O
>> problems on 64-bit
>> machine (pleasehelp:-( )
>>
>>
>> > Peter,
>> >      The problem is a simple one, but it took me a very
>> long time to
>> > figure
>> > out. You forgot to add the "ierr" parameter to
>> MPI_File_write_all :-).
>> > It's
>> > a common mistake people make in Fortran programs, and I
>> should have been
>> > more vigilant, but I did all sorts of debugging and
>> stripped your code
>> > down
>> > to until there was nothing left in it before I noticed the missing
>> > parameter
>> > :-).
>> >
>> > Rajeev
>> >
>> >
>> >> -----Original Message-----
>> >> From: Peter Diamessis [mailto:pjd38 at cornell.edu]
>> >> Sent: Friday, May 26, 2006 4:34 PM
>> >> To: ywang25 at aps.anl.gov
>> >> Cc: Ashley Pittman; thakur at mcs.anl.gov;
>> >> mpich-discuss at mcs.anl.gov; mpich2-maint at mcs.anl.gov
>> >> Subject: [MPICH2 Req #2556] Re: [MPICH] Parallel I/O problems
>> >> on 64-bit machine (pleasehelp:-( )
>> >>
>> >> Hello back YoSung and Rajeev,
>> >>
>> >> Indeed, I did try configuring MPICH2 (and MPICH1) to accomodate
>> >> 8 byte integers. When I do that I get errors with
>> >> MPI_TYPE_CREATE_SUBARRAY,
>> >> which state that the first element of the array of sizes is
>> >> set to 0. Now,
>> >> my global array
>> >> has no non-zero dimension and that is confirmed by printing
>> >> out gsizes(i)
>> >> (i=1,...,3). Hm ?
>> >>
>> >> I really apologize if I'm troubling you guys with something
>> >> totally simple.
>> >> Following
>> >> Rajeev's request, I've attached a sample program. It consists
>> >> of three
>> >> Fortran source
>> >> codes:
>> >> a) Main: the main driver.
>> >> b) mpi_setup: I had originally planned to use a 2-D domain
>> >> decomposition but
>> >> I've ended
>> >> up working with 1-D so this is more or less superfluous. It's
>> >> only needed
>> >> when
>> >> setting up the local starting indices.
>> >> c) output: The routine which gives me problems.
>> >>
>> >> I've attached the corresponding makefile to compile with
>> >> Absoft mpif90. The
>> >> file
>> >> dim.h simply specifies the dimensions of some arrays, in
>> >> particular the test
>> >> array u(...,...,...)
>> >> which is dumped out in single precision.
>> >>
>> >> When I run this simple code on my 32-bit machine it works w/o
>> >> a problem.
>> >> When I do it on the 64-bit I get the same old SIGSEGV 11 error from
>> >> MPI_FILE_WRITE_ALL .
>> >>
>> >> Again, I hope I'm not being a hassle.
>> >>
>> >> Any insight on this sample code would be greatly appreciated.
>> >> I still owe
>> >> you folks beers :-)
>> >>
>> >> Cheers,
>> >>
>> >> Peter
>> >>
>> >> ----- Original Message ----- 
>> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
>> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> Cc: "Ashley Pittman" <ashley at quadrics.com>; <thakur at mcs.anl.gov>;
>> >> <mpich-discuss at mcs.anl.gov>
>> >> Sent: Friday, May 26, 2006 5:04 PM
>> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine
>> >> (pleasehelp:-( )
>> >>
>> >>
>> >> > According to the manual, the Fortran standard requires that
>> >> INTEGER and
>> >> > REAL be the same size. The complier you used doesn't
>> confirm to the
>> >> > standard, assuming the manual is still right now (i.e. the
>> >> standard has
>> >> > not been changed).  For your case, it seems to me you may
>> >> need force the
>> >> > integer to be 8 bytes when configuring MPICH2. Further
>> >> more, 128 bits
>> >> > for DOUBLE PRECISION quantities are required for such a
>> compiler. It
>> >> > shouldn't be hard to check the size of those variables with
>> >> a small test
>> >> > in real time.
>> >> >
>> >> > These are just some of my suggestions to try. I can't
>> eliminate the
>> >> > possibility of MPI_FILE_WRITE_ALL has its own problem.
>> >> >
>> >> > Yusong
>> >> >
>> >> >
>> >> > On Fri, 2006-05-26 at 15:16 -0400, Peter Diamessis wrote:
>> >> >> Hi folks,
>> >> >>
>> >> >> Well, I did read the specific question pointed out by Yosung
>> >> >> in the MPICH2 manual. It seems to be that this is specific to
>> >> >> the GNU F95 compiler. The Absoft F90 compiler uses a default
>> >> >> 4-byte length for integers and 8-bytes for reals, i.e.
>> there is no
>> >> >> such conflict. It seems to me that configuring MPICH2 with -i4
>> >> >> is pretty much superfluous.
>> >> >>
>> >> >> Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
>> >> >> and v1.2.7p1) and I get the same error. I even tried -i8
>> >> for the heck
>> >> >> of it and I run into a whole new suite of problems. I
>> >> repeat, MPICH
>> >> >> v1.2.6
>> >> >> (including I/O)
>> >> >> has worked beautifully for me on 32-bit machines. If I
>> >> don't call my MPI
>> >> >> parallel
>> >> >> I/O routines, and more specifically I comment out the calls to
>> >> >> MPI_FILE_WRITE_ALL
>> >> >> and MPI_FILE_READ_ALL, the rest of the code works
>> perfectly fine on
>> >> >> a 64-bit machine (including other MPI I/O calls).
>> >> >>
>> >> >> So is this what Ashley pointed out ? A bug specific to
>> >> MPI_FILE_WRITE_ALL
>> >> >> and MPI_FILE_READ_ALL ?
>> >> >>
>> >> >> Any additional feedback would be very welcome.
>> >> >>
>> >> >> Many thanks in advance,
>> >> >>
>> >> >> Peter
>> >> >>
>> >> >>
>> >> >> ----- Original Message ----- 
>> >> >> From: "Ashley Pittman" <ashley at quadrics.com>
>> >> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> >> Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
>> >> >> Sent: Wednesday, May 24, 2006 7:07 AM
>> >> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine (
>> >> >> pleasehelp:-( )
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > The structf failure on 64 bit machines is a bug in the
>> >> spec, not a bug
>> >> >> > in compiler.  In effect the spec itself isn't 64bit
>> >> safe.  Following
>> >> >> > down the path of the structf error will lead to a dead end.
>> >> >> >
>> >> >> > I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL
>> >> recently, I'll
>> >> >> > see
>> >> >> > if I can dig up my notes about it.
>> >> >> >
>> >> >> > Ashley,
>> >> >> >
>> >> >> >
>> >> >> > On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
>> >> >> >> Thanks a-many YoSung,
>> >> >> >>
>> >> >> >> I'll contact the Absoft people to see if there is a
>> >> similar issue
>> >> >> >> with their F90-95 compiler. I have to be on travel tomorrow
>> >> >> >> but I'll get back to this on Thursday.
>> >> >> >>
>> >> >> >> The pointer is much appreciated,
>> >> >> >>
>> >> >> >> Peter
>> >> >> >>
>> >> >> >> ----- Original Message ----- 
>> >> >> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
>> >> >> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> >> >> Cc: <mpich-discuss at mcs.anl.gov>
>> >> >> >> Sent: Tuesday, May 23, 2006 5:53 PM
>> >> >> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit
>> >> machine ( please
>> >> >> >> help:-( )
>> >> >> >>
>> >> >> >>
>> >> >> >> > You might have read this from the manual. Just in
>> >> case if it could
>> >> >> >> > help.
>> >> >> >> >
>> >> >> >> > D.4 Q: When I use the g95 Fortran compiler on a
>> >> 64-bit platform,
>> >> >> >> > some
>> >> >> >> > of
>> >> >> >> > the tests fail
>> >> >> >> >
>> >> >> >> > A: The g95 compiler incorrectly defines the default
>> >> Fortran integer
>> >> >> >> > as
>> >> >> >> > a
>> >> >> >> > 64- bit integer while defining Fortran reals as
>> >> 32-bit values (the
>> >> >> >> > Fortran standard requires that INTEGER and REAL be
>> >> the same size).
>> >> >> >> > This
>> >> >> >> > was apparently done to allow a Fortran INTEGER to
>> >> hold the value of
>> >> >> >> > a
>> >> >> >> > pointer, rather than requiring the programmer to
>> >> select an INTEGER
>> >> >> >> > of a
>> >> >> >> > suitable KIND. To force the g95 compiler to correctly
>> >> implement the
>> >> >> >> > Fortran standard, use the -i4 flag. For example, set
>> >> the environment
>> >> >> >> > variable F90FLAGS before configuring MPICH2: setenv
>> >> F90FLAGS "-i4"
>> >> >> >> > G95
>> >> >> >> > users should note that there (at this writing) are
>> >> two distributions
>> >> >> >> > of
>> >> >> >> > g95 for 64-bit Linux platforms. One uses 32-bit
>> >> integers and reals
>> >> >> >> > (and
>> >> >> >> > conforms to the Fortran standard) and one uses 32-bit
>> >> integers and
>> >> >> >> > 64-
>> >> >> >> > bit reals. We recommend using the one that conforms
>> >> to the standard
>> >> >> >> > (note that the standard specifies the ratio of
>> sizes, not the
>> >> >> >> > absolute
>> >> >> >> > sizes, so a Fortran 95 compiler that used 64 bits for
>> >> both INTEGER
>> >> >> >> > and
>> >> >> >> > REAL would also conform to the Fortran standard.
>> >> However, such a
>> >> >> >> > compiler would need to use 128 bits for DOUBLE PRECISION
>> >> >> >> > quantities).
>> >> >> >> >
>> >> >> >> > Yusong
>> >> >> >> >
>> >> >> >> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
>> >> >> >> >> Hi again,
>> >> >> >> >>
>> >> >> >> >> I'm still obsessing as to why MPI I/O fails on my
>> >> 64-bit machine.
>> >> >> >> >> I've decided to set MPICH2 aside and work with MPICH
>> >> v1.2.6 which
>> >> >> >> >> is the one version that worked reliably for me. This
>> >> is the latest
>> >> >> >> >> I
>> >> >> >> >> observed.
>> >> >> >> >>
>> >> >> >> >> I guessed that some integer argument must be passed
>> >> wrong when
>> >> >> >> >> using
>> >> >> >> >> a 64-bit machine. I recompiled the code (I use
>> >> Absoft Pro Fortran
>> >> >> >> >> 10.0)
>> >> >> >> >> and forced the default size of  integers to be 8
>> >> bytes. Lo behold
>> >> >> >> >> my
>> >> >> >> >> I/O
>> >> >> >> >> routine crashes at an earlier point with the
>> >> following interesting
>> >> >> >> >> message:
>> >> >> >> >>
>> >> >> >> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in
>> >> array_of_sizes[1]=0
>> >> >> >> >> .
>> >> >> >> >>
>> >> >> >> >> Now, all the elements of the array os fizes should
>> >> be non-zero
>> >> >> >> >> integers,
>> >> >> >> >> e.g. 64, 64, 175 . Is some information on integers
>> >> being screwed up
>> >> >> >> >> in
>> >> >> >> >> the
>> >> >> >> >> 64-bit
>> >> >> >> >> layout ?
>> >> >> >> >>
>> >> >> >> >> Note that after a few secs. of hanging I also get
>> >> the followign:
>> >> >> >> >>
>> >> >> >> >> p0_25936: (0.089844) net_send: could not write to
>> >> fd=4, errno = 32
>> >> >> >> >>
>> >> >> >> >> This is the exact same error I get when running '
>> >> make testing '
>> >> >> >> >> after
>> >> >> >> >> having installed MPICH, i.e.:
>> >> >> >> >>
>> >> >> >> >> *** Testing Type_struct from Fortran ***
>> >> >> >> >> Differences in structf.out
>> >> >> >> >> 2,7c2
>> >> >> >> >> < 0 - MPI_ADDRESS : Address of location given to
>> >> MPI_ADDRESS does
>> >> >> >> >> not
>> >> >> >> >> fit
>> >> >> >> >> in
>> >> >> >> >> Fortran integer
>> >> >> >> >> < [0]  Aborting program !
>> >> >> >> >> < [0] Aborting program!
>> >> >> >> >> < p0_25936:  p4_error: : 972
>> >> >> >> >> < Killed by signal 2.
>> >> >> >> >> < p0_25936: (0.089844) net_send: could not write to
>> >> fd=4, errno =
>> >> >> >> >> 32
>> >> >> >> >>
>> >> >> >> >> Again, any help would be hugely appreciated. I'll
>> >> buy you guys
>> >> >> >> >> beers !
>> >> >> >> >>
>> >> >> >> >> Many thanks,
>> >> >> >> >>
>> >> >> >> >> Peter
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> ----- Original Message ----- 
>> >> >> >> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> >> >> >> To: <mpich-discuss at mcs.anl.gov>
>> >> >> >> >> Sent: Monday, May 22, 2006 2:33 PM
>> >> >> >> >> Subject: [MPICH] Parallel I/O problems on 64-bit
>> >> machine ( please
>> >> >> >> >> help
>> >> >> >> >> :-( )
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > Hello folks,
>> >> >> >> >> >
>> >> >> >> >> > I'm writing this note to ask some help with
>> running MPI on
>> >> >> >> >> > a dual proc. 64-bit Linux box I just acquired.
>> >> I've written a
>> >> >> >> >> > similar
>> >> >> >> >> > not to the mpi-bugs address but would appreciate
>> >> any additional
>> >> >> >> >> > help from anyone else in the community.
>> >> >> >> >> >
>> >> >> >> >> > I'm using MPICH v1.2.7p1,
>> >> >> >> >> > which, when tested,  seems to work wonderfully
>> >> with everything
>> >> >> >> >> > except
>> >> >> >> >> > for
>> >> >> >> >> > some specific parallel I/O calls.
>> >> >> >> >> >
>> >> >> >> >> > Specifically, whenever there is a call to
>> >> MPI_FILE_WRITE_ALL
>> >> >> >> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
>> >> >> >> >> > these I/O dumps are part of a greater CFD code which
>> >> >> >> >> > has worked fine on either a 32-bit dual proc.
>> >> Linux workstation
>> >> >> >> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
>> >> >> >> >> >
>> >> >> >> >> > In  my message to mpi-bugs, I did attach a variety
>> >> of files that
>> >> >> >> >> > could provide additional insight. In this case I'm
>> >> attaching only
>> >> >> >> >> > the Fortran source code I can gladly provide
>> more material
>> >> >> >> >> > anyone who may be interested.The troublesome
>> >> Fortran call is:
>> >> >> >> >> >
>> >> >> >> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
>> >> >> >> >> >> MPI_REAL,
>> >> >> >> >> >> MPI_STATUS_IGNORE)
>> >> >> >> >> >
>> >> >> >> >> > Upon call this, the program crashes with a SIGSEGV
>> >> 11 error.
>> >> >> >> >> > Evidently,
>> >> >> >> >> > some memory is accessed out of core ?
>> >> >> >> >> >
>> >> >> >> >> > Tempout is a single precision (Real with kind=4)
>> >> 3-D array, which
>> >> >> >> >> > has a
>> >> >> >> >> > total local
>> >> >> >> >> > number of elements on each processor equal to
>> >> local_array_size.
>> >> >> >> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
>> >> >> >> >> > status_array si appropriately dimensioned) I find
>> >> that upon
>> >> >> >> >> > error,
>> >> >> >> >> > printing out the elements of status_array yields
>> >> these huge
>> >> >> >> >> > values.
>> >> >> >> >> > This error always is always localized on processor
>> >> (N+1)/2 (proc.
>> >> >> >> >> > numbering
>> >> >> >> >> > goes from 0 to N-1).
>> >> >> >> >> >
>> >> >> >> >> > I installed MPICH2 only to observe the same results.
>> >> >> >> >> > Calls to MPI_FILE_READ_ALL will also produce
>> >> identical effects.
>> >> >> >> >> > I'll reiterate that we've never had problems with
>> >> this code on
>> >> >> >> >> > 32-bit
>> >> >> >> >> > machines.
>> >> >> >> >> >
>> >> >> >> >> > Note that uname -a returns:
>> >> >> >> >> >
>> >> >> >> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP
>> >> Wed Jan 5
>> >> >> >> >> > 19:29:47
>> >> >> >> >> > EST
>> >> >> >> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> >> >> >
>> >> >> >> >> > Am I running into problems because I've got a
>> >> 64-bit configured
>> >> >> >> >> > Linux
>> >> >> >> >> > on a
>> >> >> >> >> > 64-bit
>> >> >> >> >> > machine.
>> >> >> >> >> >
>> >> >> >> >> > Any help would HUGELY appreciated. The ability
>> to use MPI2
>> >> >> >> >> > parallel
>> >> >> >> >> > I/O
>> >> >> >> >> > on
>> >> >> >> >> > our workstation would greatly help us crunch
>> through some
>> >> >> >> >> > existing
>> >> >> >> >> > large
>> >> >> >> >> > datafiles
>> >> >> >> >> > generated on 32-bit machines.
>> >> >> >> >> >
>> >> >> >> >> > Cheers,
>> >> >> >> >> >
>> >> >> >> >> > Peter
>> >> >> >> >> >
>> >> >> >> >> >
>> >> -------------------------------------------------------------
>> >> >> >> >> > Peter Diamessis
>> >> >> >> >> > Assistant Professor
>> >> >> >> >> > Environmental Fluid Mechanics & Hydrology
>> >> >> >> >> > School of Civil and Environmental Engineering
>> >> >> >> >> > Cornell University
>> >> >> >> >> > Ithaca, NY 14853
>> >> >> >> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
>> >> >> >> >> > pjd38 at cornell.edu
>> >> >> >> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >>
>> >
>>
>>
>>
>