[MPICH] Parallel I/O problems on 64-bit machine (pleasehelp:-( )

Peter Diamessis pjd38 at cornell.edu
Fri May 26 17:19:37 CDT 2006


Hi back Rajeev,

Yes, indeed. I tried that reconfiguring, making both MPICH2 and MPICH1 
(v1.2.6)
to use 8-byte integers, as I wrote in my last message. In this case, I get 
errors
with MPI_TYPE_CREATE_SUBARRAY.

I'm turning in circles here I think. If you have any thoughts on the sample 
program
I sent I'd greatly appreciate it. Again, I hope I'm not being a pain.

Sincerely,

Peter

----- Original Message ----- 
From: "Rajeev Thakur" <thakur at mcs.anl.gov>
To: "'Peter Diamessis'" <pjd38 at cornell.edu>; <ywang25 at aps.anl.gov>
Cc: "'Ashley Pittman'" <ashley at quadrics.com>; <mpich-discuss at mcs.anl.gov>
Sent: Friday, May 26, 2006 5:39 PM
Subject: RE: [MPICH] Parallel I/O problems on 64-bit machine 
(pleasehelp:-( )


> Did you reconfigure MPICH2 to use 8-byte integers (ie set the environment
> variable, run configure, make) or run recompile your code with the option
> for 8-byte integers?
>
> Rajeev
>
>> -----Original Message-----
>> From: Peter Diamessis [mailto:pjd38 at cornell.edu]
>> Sent: Friday, May 26, 2006 4:34 PM
>> To: ywang25 at aps.anl.gov
>> Cc: Ashley Pittman; thakur at mcs.anl.gov; mpich-discuss at mcs.anl.gov
>> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine
>> (pleasehelp:-( )
>>
>> Hello back YoSung and Rajeev,
>>
>> Indeed, I did try configuring MPICH2 (and MPICH1) to accomodate
>> 8 byte integers. When I do that I get errors with
>> MPI_TYPE_CREATE_SUBARRAY,
>> which state that the first element of the array of sizes is
>> set to 0. Now,
>> my global array
>> has no non-zero dimension and that is confirmed by printing
>> out gsizes(i)
>> (i=1,...,3). Hm ?
>>
>> I really apologize if I'm troubling you guys with something
>> totally simple.
>> Following
>> Rajeev's request, I've attached a sample program. It consists
>> of three
>> Fortran source
>> codes:
>> a) Main: the main driver.
>> b) mpi_setup: I had originally planned to use a 2-D domain
>> decomposition but
>> I've ended
>> up working with 1-D so this is more or less superfluous. It's
>> only needed
>> when
>> setting up the local starting indices.
>> c) output: The routine which gives me problems.
>>
>> I've attached the corresponding makefile to compile with
>> Absoft mpif90. The
>> file
>> dim.h simply specifies the dimensions of some arrays, in
>> particular the test
>> array u(...,...,...)
>> which is dumped out in single precision.
>>
>> When I run this simple code on my 32-bit machine it works w/o
>> a problem.
>> When I do it on the 64-bit I get the same old SIGSEGV 11 error from
>> MPI_FILE_WRITE_ALL .
>>
>> Again, I hope I'm not being a hassle.
>>
>> Any insight on this sample code would be greatly appreciated.
>> I still owe
>> you folks beers :-)
>>
>> Cheers,
>>
>> Peter
>>
>> ----- Original Message ----- 
>> From: "Yusong Wang" <ywang25 at aps.anl.gov>
>> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> Cc: "Ashley Pittman" <ashley at quadrics.com>; <thakur at mcs.anl.gov>;
>> <mpich-discuss at mcs.anl.gov>
>> Sent: Friday, May 26, 2006 5:04 PM
>> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine
>> (pleasehelp:-( )
>>
>>
>> > According to the manual, the Fortran standard requires that
>> INTEGER and
>> > REAL be the same size. The complier you used doesn't confirm to the
>> > standard, assuming the manual is still right now (i.e. the
>> standard has
>> > not been changed).  For your case, it seems to me you may
>> need force the
>> > integer to be 8 bytes when configuring MPICH2. Further
>> more, 128 bits
>> > for DOUBLE PRECISION quantities are required for such a compiler. It
>> > shouldn't be hard to check the size of those variables with
>> a small test
>> > in real time.
>> >
>> > These are just some of my suggestions to try. I can't eliminate the
>> > possibility of MPI_FILE_WRITE_ALL has its own problem.
>> >
>> > Yusong
>> >
>> >
>> > On Fri, 2006-05-26 at 15:16 -0400, Peter Diamessis wrote:
>> >> Hi folks,
>> >>
>> >> Well, I did read the specific question pointed out by Yosung
>> >> in the MPICH2 manual. It seems to be that this is specific to
>> >> the GNU F95 compiler. The Absoft F90 compiler uses a default
>> >> 4-byte length for integers and 8-bytes for reals, i.e. there is no
>> >> such conflict. It seems to me that configuring MPICH2 with -i4
>> >> is pretty much superfluous.
>> >>
>> >> Nevertheless, I tried it on both MPICH2 as well as MPICH (v1.2.6
>> >> and v1.2.7p1) and I get the same error. I even tried -i8
>> for the heck
>> >> of it and I run into a whole new suite of problems. I
>> repeat, MPICH
>> >> v1.2.6
>> >> (including I/O)
>> >> has worked beautifully for me on 32-bit machines. If I
>> don't call my MPI
>> >> parallel
>> >> I/O routines, and more specifically I comment out the calls to
>> >> MPI_FILE_WRITE_ALL
>> >> and MPI_FILE_READ_ALL, the rest of the code works perfectly fine on
>> >> a 64-bit machine (including other MPI I/O calls).
>> >>
>> >> So is this what Ashley pointed out ? A bug specific to
>> MPI_FILE_WRITE_ALL
>> >> and MPI_FILE_READ_ALL ?
>> >>
>> >> Any additional feedback would be very welcome.
>> >>
>> >> Many thanks in advance,
>> >>
>> >> Peter
>> >>
>> >>
>> >> ----- Original Message ----- 
>> >> From: "Ashley Pittman" <ashley at quadrics.com>
>> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> Cc: <ywang25 at aps.anl.gov>; <mpich-discuss at mcs.anl.gov>
>> >> Sent: Wednesday, May 24, 2006 7:07 AM
>> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit machine (
>> >> pleasehelp:-( )
>> >>
>> >>
>> >> >
>> >> > The structf failure on 64 bit machines is a bug in the
>> spec, not a bug
>> >> > in compiler.  In effect the spec itself isn't 64bit
>> safe.  Following
>> >> > down the path of the structf error will lead to a dead end.
>> >> >
>> >> > I'm fairly sure I've seen a bug in MPI_FILE_WRITE_ALL
>> recently, I'll
>> >> > see
>> >> > if I can dig up my notes about it.
>> >> >
>> >> > Ashley,
>> >> >
>> >> >
>> >> > On Tue, 2006-05-23 at 18:52 -0400, Peter Diamessis wrote:
>> >> >> Thanks a-many YoSung,
>> >> >>
>> >> >> I'll contact the Absoft people to see if there is a
>> similar issue
>> >> >> with their F90-95 compiler. I have to be on travel tomorrow
>> >> >> but I'll get back to this on Thursday.
>> >> >>
>> >> >> The pointer is much appreciated,
>> >> >>
>> >> >> Peter
>> >> >>
>> >> >> ----- Original Message ----- 
>> >> >> From: "Yusong Wang" <ywang25 at aps.anl.gov>
>> >> >> To: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> >> Cc: <mpich-discuss at mcs.anl.gov>
>> >> >> Sent: Tuesday, May 23, 2006 5:53 PM
>> >> >> Subject: Re: [MPICH] Parallel I/O problems on 64-bit
>> machine ( please
>> >> >> help:-( )
>> >> >>
>> >> >>
>> >> >> > You might have read this from the manual. Just in
>> case if it could
>> >> >> > help.
>> >> >> >
>> >> >> > D.4 Q: When I use the g95 Fortran compiler on a
>> 64-bit platform,
>> >> >> > some
>> >> >> > of
>> >> >> > the tests fail
>> >> >> >
>> >> >> > A: The g95 compiler incorrectly defines the default
>> Fortran integer
>> >> >> > as
>> >> >> > a
>> >> >> > 64- bit integer while defining Fortran reals as
>> 32-bit values (the
>> >> >> > Fortran standard requires that INTEGER and REAL be
>> the same size).
>> >> >> > This
>> >> >> > was apparently done to allow a Fortran INTEGER to
>> hold the value of
>> >> >> > a
>> >> >> > pointer, rather than requiring the programmer to
>> select an INTEGER
>> >> >> > of a
>> >> >> > suitable KIND. To force the g95 compiler to correctly
>> implement the
>> >> >> > Fortran standard, use the -i4 flag. For example, set
>> the environment
>> >> >> > variable F90FLAGS before configuring MPICH2: setenv
>> F90FLAGS "-i4"
>> >> >> > G95
>> >> >> > users should note that there (at this writing) are
>> two distributions
>> >> >> > of
>> >> >> > g95 for 64-bit Linux platforms. One uses 32-bit
>> integers and reals
>> >> >> > (and
>> >> >> > conforms to the Fortran standard) and one uses 32-bit
>> integers and
>> >> >> > 64-
>> >> >> > bit reals. We recommend using the one that conforms
>> to the standard
>> >> >> > (note that the standard specifies the ratio of sizes, not the
>> >> >> > absolute
>> >> >> > sizes, so a Fortran 95 compiler that used 64 bits for
>> both INTEGER
>> >> >> > and
>> >> >> > REAL would also conform to the Fortran standard.
>> However, such a
>> >> >> > compiler would need to use 128 bits for DOUBLE PRECISION
>> >> >> > quantities).
>> >> >> >
>> >> >> > Yusong
>> >> >> >
>> >> >> > On Tue, 2006-05-23 at 14:48 -0400, Peter Diamessis wrote:
>> >> >> >> Hi again,
>> >> >> >>
>> >> >> >> I'm still obsessing as to why MPI I/O fails on my
>> 64-bit machine.
>> >> >> >> I've decided to set MPICH2 aside and work with MPICH
>> v1.2.6 which
>> >> >> >> is the one version that worked reliably for me. This
>> is the latest
>> >> >> >> I
>> >> >> >> observed.
>> >> >> >>
>> >> >> >> I guessed that some integer argument must be passed
>> wrong when
>> >> >> >> using
>> >> >> >> a 64-bit machine. I recompiled the code (I use
>> Absoft Pro Fortran
>> >> >> >> 10.0)
>> >> >> >> and forced the default size of  integers to be 8
>> bytes. Lo behold
>> >> >> >> my
>> >> >> >> I/O
>> >> >> >> routine crashes at an earlier point with the
>> following interesting
>> >> >> >> message:
>> >> >> >>
>> >> >> >> 0 - MPI_TYPE_CREATE_SUBARRAY: Invalid value in
>> array_of_sizes[1]=0
>> >> >> >> .
>> >> >> >>
>> >> >> >> Now, all the elements of the array os fizes should
>> be non-zero
>> >> >> >> integers,
>> >> >> >> e.g. 64, 64, 175 . Is some information on integers
>> being screwed up
>> >> >> >> in
>> >> >> >> the
>> >> >> >> 64-bit
>> >> >> >> layout ?
>> >> >> >>
>> >> >> >> Note that after a few secs. of hanging I also get
>> the followign:
>> >> >> >>
>> >> >> >> p0_25936: (0.089844) net_send: could not write to
>> fd=4, errno = 32
>> >> >> >>
>> >> >> >> This is the exact same error I get when running '
>> make testing '
>> >> >> >> after
>> >> >> >> having installed MPICH, i.e.:
>> >> >> >>
>> >> >> >> *** Testing Type_struct from Fortran ***
>> >> >> >> Differences in structf.out
>> >> >> >> 2,7c2
>> >> >> >> < 0 - MPI_ADDRESS : Address of location given to
>> MPI_ADDRESS does
>> >> >> >> not
>> >> >> >> fit
>> >> >> >> in
>> >> >> >> Fortran integer
>> >> >> >> < [0]  Aborting program !
>> >> >> >> < [0] Aborting program!
>> >> >> >> < p0_25936:  p4_error: : 972
>> >> >> >> < Killed by signal 2.
>> >> >> >> < p0_25936: (0.089844) net_send: could not write to
>> fd=4, errno =
>> >> >> >> 32
>> >> >> >>
>> >> >> >> Again, any help would be hugely appreciated. I'll
>> buy you guys
>> >> >> >> beers !
>> >> >> >>
>> >> >> >> Many thanks,
>> >> >> >>
>> >> >> >> Peter
>> >> >> >>
>> >> >> >>
>> >> >> >> ----- Original Message ----- 
>> >> >> >> From: "Peter Diamessis" <pjd38 at cornell.edu>
>> >> >> >> To: <mpich-discuss at mcs.anl.gov>
>> >> >> >> Sent: Monday, May 22, 2006 2:33 PM
>> >> >> >> Subject: [MPICH] Parallel I/O problems on 64-bit
>> machine ( please
>> >> >> >> help
>> >> >> >> :-( )
>> >> >> >>
>> >> >> >>
>> >> >> >> > Hello folks,
>> >> >> >> >
>> >> >> >> > I'm writing this note to ask some help with running MPI on
>> >> >> >> > a dual proc. 64-bit Linux box I just acquired.
>> I've written a
>> >> >> >> > similar
>> >> >> >> > not to the mpi-bugs address but would appreciate
>> any additional
>> >> >> >> > help from anyone else in the community.
>> >> >> >> >
>> >> >> >> > I'm using MPICH v1.2.7p1,
>> >> >> >> > which, when tested,  seems to work wonderfully
>> with everything
>> >> >> >> > except
>> >> >> >> > for
>> >> >> >> > some specific parallel I/O calls.
>> >> >> >> >
>> >> >> >> > Specifically, whenever there is a call to
>> MPI_FILE_WRITE_ALL
>> >> >> >> > or MPI_FILE_READ_ALL an SIGSEGV error pops up. Note that
>> >> >> >> > these I/O dumps are part of a greater CFD code which
>> >> >> >> > has worked fine on either a 32-bit dual proc.
>> Linux workstation
>> >> >> >> > or the USC-HPCC Linux cluster (where I was a postdoc).
>> >> >> >> >
>> >> >> >> > In  my message to mpi-bugs, I did attach a variety
>> of files that
>> >> >> >> > could provide additional insight. In this case I'm
>> attaching only
>> >> >> >> > the Fortran source code I can gladly provide more material
>> >> >> >> > anyone who may be interested.The troublesome
>> Fortran call is:
>> >> >> >> >
>> >> >> >> >   call MPI_FILE_WRITE_ALL(fh, tempout, local_array_size,
>> >> >> >> >> MPI_REAL,
>> >> >> >> >> MPI_STATUS_IGNORE)
>> >> >> >> >
>> >> >> >> > Upon call this, the program crashes with a SIGSEGV
>> 11 error.
>> >> >> >> > Evidently,
>> >> >> >> > some memory is accessed out of core ?
>> >> >> >> >
>> >> >> >> > Tempout is a single precision (Real with kind=4)
>> 3-D array, which
>> >> >> >> > has a
>> >> >> >> > total local
>> >> >> >> > number of elements on each processor equal to
>> local_array_size.
>> >> >> >> > If I change MPI_STATUS_ARRAY to status_array,ierr (where
>> >> >> >> > status_array si appropriately dimensioned) I find
>> that upon
>> >> >> >> > error,
>> >> >> >> > printing out the elements of status_array yields
>> these huge
>> >> >> >> > values.
>> >> >> >> > This error always is always localized on processor
>> (N+1)/2 (proc.
>> >> >> >> > numbering
>> >> >> >> > goes from 0 to N-1).
>> >> >> >> >
>> >> >> >> > I installed MPICH2 only to observe the same results.
>> >> >> >> > Calls to MPI_FILE_READ_ALL will also produce
>> identical effects.
>> >> >> >> > I'll reiterate that we've never had problems with
>> this code on
>> >> >> >> > 32-bit
>> >> >> >> > machines.
>> >> >> >> >
>> >> >> >> > Note that uname -a returns:
>> >> >> >> >
>> >> >> >> > Linux pacific.cee.cornell.edu 2.6.9-5.ELsmp #1 SMP
>> Wed Jan 5
>> >> >> >> > 19:29:47
>> >> >> >> > EST
>> >> >> >> > 2005 x86_64 x86_64 x86_64 GNU/Linux
>> >> >> >> >
>> >> >> >> > Am I running into problems because I've got a
>> 64-bit configured
>> >> >> >> > Linux
>> >> >> >> > on a
>> >> >> >> > 64-bit
>> >> >> >> > machine.
>> >> >> >> >
>> >> >> >> > Any help would HUGELY appreciated. The ability to use MPI2
>> >> >> >> > parallel
>> >> >> >> > I/O
>> >> >> >> > on
>> >> >> >> > our workstation would greatly help us crunch through some
>> >> >> >> > existing
>> >> >> >> > large
>> >> >> >> > datafiles
>> >> >> >> > generated on 32-bit machines.
>> >> >> >> >
>> >> >> >> > Cheers,
>> >> >> >> >
>> >> >> >> > Peter
>> >> >> >> >
>> >> >> >> >
>> -------------------------------------------------------------
>> >> >> >> > Peter Diamessis
>> >> >> >> > Assistant Professor
>> >> >> >> > Environmental Fluid Mechanics & Hydrology
>> >> >> >> > School of Civil and Environmental Engineering
>> >> >> >> > Cornell University
>> >> >> >> > Ithaca, NY 14853
>> >> >> >> > Phone: (607)-255-1719 --- Fax: (607)-255-9004
>> >> >> >> > pjd38 at cornell.edu
>> >> >> >> > http://www.cee.cornell.edu/fbxk/fcbo.cfm?pid=494
>> >> >> >> >
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >
>>
> 





More information about the mpich-discuss mailing list