[mpich-discuss] DataType Problem

Michele Rosso michele.rosso84 at gmail.com
Mon Jan 31 16:40:05 CST 2011


Yes, I did but I prefer the handle it personally to have more control on
what I am doing. Besides, I cannot be sure such a library is available
on every supercomputer I will use during my PhD.
Finally, I have some intermediate steps to do in between a 1D
transposition and the following one.

Michele

On Mon, 2011-01-31 at 16:22 -0600, Anthony Chan wrote:
> Given you are using all2all to do transposition, did you look into
> using existing library, e.g. P3DFFT (it is a fortran library) ?   
> 
> A.Chan
> 
> ----- Original Message -----
> > Hi Jim,
> > 
> > first of all thanks a lot for your help.
> > During the week end I completely rewrite the subroutine. Now it works
> > but still I have small problems.
> > 
> > What I am trying to accomplish is a matrix transposition.
> > I need to perform a 3D FFT on a N^3 matrix, where N is always a power
> > of
> > 2.
> > I am using a 2D domain decomposition. The problem is that the first
> > fft
> > (from real to complex) results in N/2+1 points along the coordinate 1.
> > When I transpose from direction 1 to direction 2 (along columns in my
> > setup) not all the processors have the same amount of data.
> > 
> > I attached the current subroutine I am using (do not consider
> > sgt23_pmu
> > since it will serve for the transposition 2->3 and it is not complete
> > yet).
> > 
> > If I perform the direct transposition (1-->2) with the new subroutine,
> > it works perfectly. The only problem is that it crashes if I try to
> > free
> > the datatype at the end.
> > 
> > If I perform the inverse transposition (2-->1), it works as expected
> > only if I use a number of processors which is a perfect square (say
> > 16 ).
> > If I try with 8 processors for example ( so the domain is decomposed
> > into beams with a rectangular base ) the inverse transposition does
> > not
> > work anymore and i receive the following message:
> > 
> > Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> > MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1eb9390, scnts=0x1e9c780,
> > sdispls=0x1e9ee80, stypes=0x1e9eea0, rbuf=0x1e9dc70, rcnts=0x1e9eec0,
> > rdispls=0x1e9f100, rtypes=0x1e9ca90, comm=0x84000004) failed
> > (unknown)(): Pending request (no error)
> > Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> > MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1cf5370, scnts=0x1cf1780,
> > sdispls=0x1cf3e80, stypes=0x1cf3ea0, rbuf=0x1cf2c70, rcnts=0x1cf3ec0,
> > rdispls=0x1cf4100, rtypes=0x1cf1a70, comm=0x84000004) failed
> > (unknown)(): Pending request (no error)
> > rank 5 in job 26 enterprise_57863 caused collective abort of all
> > ranks
> > exit status of rank 5: killed by signal 9
> > Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> > MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x218e5a0, scnts=0x2170780,
> > sdispls=0x2170a40, stypes=0x2170a60, rbuf=0x218c180, rcnts=0x2170a80,
> > rdispls=0x2170d10, rtypes=0x2170d30, comm=0x84000004) failed
> > (unknown)(): Pending request (no error)
> > 
> > So my problems are essentially two:
> > 
> > 1) Unable to free the datatypes
> > 
> > 2) Unable to perform the backward transposition when N_proc is not
> > perfect square.
> > 
> > Again, thanks a lot for your help,
> > 
> > Michele
> > 
> > On Mon, 2011-01-31 at 13:44 -0600, James Dinan wrote:
> > > Hi Michele,
> > >
> > > I've attached a small test case derived from what you sent. This
> > > runs
> > > fine for me with the integer change suggested below.
> > >
> > > I'm still a little confused about the need for
> > > mpi_type_create_resized(). You're setting the lower bound to 1 and
> > > the
> > > extent to the size of a double complex. These adjustments are in
> > > bytes,
> > > so if I'm interpreting this correctly you are effectively shifting
> > > the
> > > beginning of the data type 1 byte into the first value in the array
> > > and
> > > then accessing a full double complex from that location. This seems
> > > like it's probably not what you would want to do.
> > >
> > > Could you explain the subset of the data you're trying to cover with
> > > the
> > > datatype?
> > >
> > > Thanks,
> > >   ~Jim.
> > >
> > > On 01/31/2011 11:13 AM, James Dinan wrote:
> > > > Hi Michele,
> > > >
> > > > Another quick comment:
> > > >
> > > > Don't forget to free your MPI datatypes when you're finished with
> > > > them.
> > > > This shouldn't cause the error you're seeing, but it can be a
> > > > resource
> > > > leak that builds up over time if you call this routine frequently.
> > > >
> > > > call mpi_type_free(temp, errorMPI)
> > > > call mpi_type_free(temp2, errorMPI)
> > > > call mpi_type_free(temp3, errorMPI)
> > > >
> > > > Best,
> > > > ~Jim.
> > > >
> > > > On 01/31/2011 11:07 AM, James Dinan wrote:
> > > >> Hi Michele,
> > > >>
> > > >> I'm looking this over and trying to put together a test case from
> > > >> the
> > > >> code you sent. One thing that looks questionable is the type for
> > > >> 'ext'.
> > > >> The call to mpi_type_size wants an integer, however the
> > > >> mpi_type_create_resized calls want an integer of
> > > >> kind=MPI_ADDRESS_KIND.
> > > >> Could you try adding something like this:
> > > >>
> > > >> integer :: dcsize
> > > >> integer (kind=MPI_ADDRESS_KIND) :: ext
> > > >>
> > > >> call mpi_type_size( mpi_double_complex , dcsize , errorMPI)
> > > >> ext = dcsize
> > > >>
> > > >> Thanks,
> > > >> ~Jim.
> > > >>
> > > >> On 01/30/2011 02:15 AM, Michele Rosso wrote:
> > > >>> Hi,
> > > >>>
> > > >>>
> > > >>> I am developing a subroutine to handle the communication inside
> > > >>> a group
> > > >>> of processors.
> > > >>> The source code is attached.
> > > >>>
> > > >>> Such subroutine is contained in a module and accesses many of
> > > >>> the data
> > > >>> it needs and the header "mpi.h" from another module (pmu_var).
> > > >>>
> > > >>> As an input I have a 3D array (work1) which is allocated in the
> > > >>> main
> > > >>> program. As an output I have another 3D matrix (work2) which is
> > > >>> allocated in the main program too. Both of them are of type
> > > >>> complex and
> > > >>> have intent INOUT (I wanna use the subroutine in a reversible
> > > >>> way ).
> > > >>>
> > > >>> Since the data I wanna send are not contiguous, I defined
> > > >>> several data
> > > >>> types. Then I tested all of them with a simple send-receive
> > > >>> communication in the group "mpi_comm_world".
> > > >>> The problem arises when I tested the data type "temp3": the
> > > >>> esecution of
> > > >>> the program stops and I receive the error:
> > > >>>
> > > >>> rank 0 in job 8 enterprise_45569 caused collective abort of all
> > > >>> ranks
> > > >>> exit status of rank 0: killed by signal 9
> > > >>>
> > > >>> Notice that work1 and work2 have different size but the same
> > > >>> shape and
> > > >>> the data type should be coherent with them.
> > > >>>
> > > >>> Has anyone and idea of which the problem could be?
> > > >>>
> > > >>>
> > > >>> Thanks in advance,
> > > >>>
> > > >>> Michele
> > > >>>
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> mpich-discuss mailing list
> > > >>> mpich-discuss at mcs.anl.gov
> > > >>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > > >>
> > > >> _______________________________________________
> > > >> mpich-discuss mailing list
> > > >> mpich-discuss at mcs.anl.gov
> > > >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > > >
> > > > _______________________________________________
> > > > mpich-discuss mailing list
> > > > mpich-discuss at mcs.anl.gov
> > > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list
> > > mpich-discuss at mcs.anl.gov
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > 
> > 
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss




More information about the mpich-discuss mailing list