[mpich-discuss] DataType Problem

Anthony Chan chan at mcs.anl.gov
Mon Jan 31 16:22:36 CST 2011


Given you are using all2all to do transposition, did you look into
using existing library, e.g. P3DFFT (it is a fortran library) ?   

A.Chan

----- Original Message -----
> Hi Jim,
> 
> first of all thanks a lot for your help.
> During the week end I completely rewrite the subroutine. Now it works
> but still I have small problems.
> 
> What I am trying to accomplish is a matrix transposition.
> I need to perform a 3D FFT on a N^3 matrix, where N is always a power
> of
> 2.
> I am using a 2D domain decomposition. The problem is that the first
> fft
> (from real to complex) results in N/2+1 points along the coordinate 1.
> When I transpose from direction 1 to direction 2 (along columns in my
> setup) not all the processors have the same amount of data.
> 
> I attached the current subroutine I am using (do not consider
> sgt23_pmu
> since it will serve for the transposition 2->3 and it is not complete
> yet).
> 
> If I perform the direct transposition (1-->2) with the new subroutine,
> it works perfectly. The only problem is that it crashes if I try to
> free
> the datatype at the end.
> 
> If I perform the inverse transposition (2-->1), it works as expected
> only if I use a number of processors which is a perfect square (say
> 16 ).
> If I try with 8 processors for example ( so the domain is decomposed
> into beams with a rectangular base ) the inverse transposition does
> not
> work anymore and i receive the following message:
> 
> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1eb9390, scnts=0x1e9c780,
> sdispls=0x1e9ee80, stypes=0x1e9eea0, rbuf=0x1e9dc70, rcnts=0x1e9eec0,
> rdispls=0x1e9f100, rtypes=0x1e9ca90, comm=0x84000004) failed
> (unknown)(): Pending request (no error)
> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1cf5370, scnts=0x1cf1780,
> sdispls=0x1cf3e80, stypes=0x1cf3ea0, rbuf=0x1cf2c70, rcnts=0x1cf3ec0,
> rdispls=0x1cf4100, rtypes=0x1cf1a70, comm=0x84000004) failed
> (unknown)(): Pending request (no error)
> rank 5 in job 26 enterprise_57863 caused collective abort of all
> ranks
> exit status of rank 5: killed by signal 9
> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x218e5a0, scnts=0x2170780,
> sdispls=0x2170a40, stypes=0x2170a60, rbuf=0x218c180, rcnts=0x2170a80,
> rdispls=0x2170d10, rtypes=0x2170d30, comm=0x84000004) failed
> (unknown)(): Pending request (no error)
> 
> So my problems are essentially two:
> 
> 1) Unable to free the datatypes
> 
> 2) Unable to perform the backward transposition when N_proc is not
> perfect square.
> 
> Again, thanks a lot for your help,
> 
> Michele
> 
> On Mon, 2011-01-31 at 13:44 -0600, James Dinan wrote:
> > Hi Michele,
> >
> > I've attached a small test case derived from what you sent. This
> > runs
> > fine for me with the integer change suggested below.
> >
> > I'm still a little confused about the need for
> > mpi_type_create_resized(). You're setting the lower bound to 1 and
> > the
> > extent to the size of a double complex. These adjustments are in
> > bytes,
> > so if I'm interpreting this correctly you are effectively shifting
> > the
> > beginning of the data type 1 byte into the first value in the array
> > and
> > then accessing a full double complex from that location. This seems
> > like it's probably not what you would want to do.
> >
> > Could you explain the subset of the data you're trying to cover with
> > the
> > datatype?
> >
> > Thanks,
> >   ~Jim.
> >
> > On 01/31/2011 11:13 AM, James Dinan wrote:
> > > Hi Michele,
> > >
> > > Another quick comment:
> > >
> > > Don't forget to free your MPI datatypes when you're finished with
> > > them.
> > > This shouldn't cause the error you're seeing, but it can be a
> > > resource
> > > leak that builds up over time if you call this routine frequently.
> > >
> > > call mpi_type_free(temp, errorMPI)
> > > call mpi_type_free(temp2, errorMPI)
> > > call mpi_type_free(temp3, errorMPI)
> > >
> > > Best,
> > > ~Jim.
> > >
> > > On 01/31/2011 11:07 AM, James Dinan wrote:
> > >> Hi Michele,
> > >>
> > >> I'm looking this over and trying to put together a test case from
> > >> the
> > >> code you sent. One thing that looks questionable is the type for
> > >> 'ext'.
> > >> The call to mpi_type_size wants an integer, however the
> > >> mpi_type_create_resized calls want an integer of
> > >> kind=MPI_ADDRESS_KIND.
> > >> Could you try adding something like this:
> > >>
> > >> integer :: dcsize
> > >> integer (kind=MPI_ADDRESS_KIND) :: ext
> > >>
> > >> call mpi_type_size( mpi_double_complex , dcsize , errorMPI)
> > >> ext = dcsize
> > >>
> > >> Thanks,
> > >> ~Jim.
> > >>
> > >> On 01/30/2011 02:15 AM, Michele Rosso wrote:
> > >>> Hi,
> > >>>
> > >>>
> > >>> I am developing a subroutine to handle the communication inside
> > >>> a group
> > >>> of processors.
> > >>> The source code is attached.
> > >>>
> > >>> Such subroutine is contained in a module and accesses many of
> > >>> the data
> > >>> it needs and the header "mpi.h" from another module (pmu_var).
> > >>>
> > >>> As an input I have a 3D array (work1) which is allocated in the
> > >>> main
> > >>> program. As an output I have another 3D matrix (work2) which is
> > >>> allocated in the main program too. Both of them are of type
> > >>> complex and
> > >>> have intent INOUT (I wanna use the subroutine in a reversible
> > >>> way ).
> > >>>
> > >>> Since the data I wanna send are not contiguous, I defined
> > >>> several data
> > >>> types. Then I tested all of them with a simple send-receive
> > >>> communication in the group "mpi_comm_world".
> > >>> The problem arises when I tested the data type "temp3": the
> > >>> esecution of
> > >>> the program stops and I receive the error:
> > >>>
> > >>> rank 0 in job 8 enterprise_45569 caused collective abort of all
> > >>> ranks
> > >>> exit status of rank 0: killed by signal 9
> > >>>
> > >>> Notice that work1 and work2 have different size but the same
> > >>> shape and
> > >>> the data type should be coherent with them.
> > >>>
> > >>> Has anyone and idea of which the problem could be?
> > >>>
> > >>>
> > >>> Thanks in advance,
> > >>>
> > >>> Michele
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> mpich-discuss mailing list
> > >>> mpich-discuss at mcs.anl.gov
> > >>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >>
> > >> _______________________________________________
> > >> mpich-discuss mailing list
> > >> mpich-discuss at mcs.anl.gov
> > >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list
> > > mpich-discuss at mcs.anl.gov
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list