[mpich-discuss] DataType Problem

James Dinan dinan at mcs.anl.gov
Thu Feb 3 11:13:51 CST 2011


Hi Michele,

Why are you unable to free the MPI types?

In general, it is safe to free datatypes when you are finished passing 
them into MPI calls.  The MPI implementation will continue to hold on to 
the datatypes internally if they are needed for communication you have 
already issued and free them only when all of those operations complete.

Best,
  ~Jim.

On 01/31/2011 02:03 PM, Michele Rosso wrote:
> Hi Jim,
>
> first of all thanks a lot for your help.
> During the week end I completely rewrite the subroutine. Now it works
> but still I have small problems.
>
> What I am trying to accomplish is a matrix transposition.
> I need to perform a 3D FFT on a N^3 matrix, where N is always a power of
> 2.
> I am using a 2D domain decomposition. The problem is that the first fft
> (from real to complex) results in N/2+1 points along the coordinate 1.
> When I transpose from direction 1 to direction 2 (along columns in my
> setup) not all the processors have the same amount of data.
>
> I attached the current subroutine I am using (do not consider sgt23_pmu
> since it will serve for the transposition 2->3 and it is not complete
> yet).
>
> If I perform the direct transposition (1-->2) with the new subroutine,
> it works perfectly. The only problem is that it crashes if I try to free
> the datatype at the end.
>
> If I perform the inverse transposition (2-->1), it works as expected
> only if I use a number of processors which is a perfect square (say
> 16 ).
> If I try with 8 processors for example ( so the domain is decomposed
> into beams with a rectangular base ) the inverse transposition does not
> work anymore and i receive the following message:
>
> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1eb9390, scnts=0x1e9c780,
> sdispls=0x1e9ee80, stypes=0x1e9eea0, rbuf=0x1e9dc70, rcnts=0x1e9eec0,
> rdispls=0x1e9f100, rtypes=0x1e9ca90, comm=0x84000004) failed
> (unknown)(): Pending request (no error)
> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1cf5370, scnts=0x1cf1780,
> sdispls=0x1cf3e80, stypes=0x1cf3ea0, rbuf=0x1cf2c70, rcnts=0x1cf3ec0,
> rdispls=0x1cf4100, rtypes=0x1cf1a70, comm=0x84000004) failed
> (unknown)(): Pending request (no error)
> rank 5 in job 26  enterprise_57863   caused collective abort of all
> ranks
>    exit status of rank 5: killed by signal 9
> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x218e5a0, scnts=0x2170780,
> sdispls=0x2170a40, stypes=0x2170a60, rbuf=0x218c180, rcnts=0x2170a80,
> rdispls=0x2170d10, rtypes=0x2170d30, comm=0x84000004) failed
> (unknown)(): Pending request (no error)
>
> So my problems are essentially two:
>
> 1) Unable to free the datatypes
>
> 2) Unable to perform the backward transposition when N_proc is not
> perfect square.
>
> Again, thanks a lot for your help,
>
> Michele
>
> On Mon, 2011-01-31 at 13:44 -0600, James Dinan wrote:
>> Hi Michele,
>>
>> I've attached a small test case derived from what you sent.  This runs
>> fine for me with the integer change suggested below.
>>
>> I'm still a little confused about the need for
>> mpi_type_create_resized().  You're setting the lower bound to 1 and the
>> extent to the size of a double complex.  These adjustments are in bytes,
>> so if I'm interpreting this correctly you are effectively shifting the
>> beginning of the data type 1 byte into the first value in the array and
>> then accessing a full double complex from that location.  This seems
>> like it's probably not what you would want to do.
>>
>> Could you explain the subset of the data you're trying to cover with the
>> datatype?
>>
>> Thanks,
>>    ~Jim.
>>
>> On 01/31/2011 11:13 AM, James Dinan wrote:
>>> Hi Michele,
>>>
>>> Another quick comment:
>>>
>>> Don't forget to free your MPI datatypes when you're finished with them.
>>> This shouldn't cause the error you're seeing, but it can be a resource
>>> leak that builds up over time if you call this routine frequently.
>>>
>>> call mpi_type_free(temp, errorMPI)
>>> call mpi_type_free(temp2, errorMPI)
>>> call mpi_type_free(temp3, errorMPI)
>>>
>>> Best,
>>> ~Jim.
>>>
>>> On 01/31/2011 11:07 AM, James Dinan wrote:
>>>> Hi Michele,
>>>>
>>>> I'm looking this over and trying to put together a test case from the
>>>> code you sent. One thing that looks questionable is the type for 'ext'.
>>>> The call to mpi_type_size wants an integer, however the
>>>> mpi_type_create_resized calls want an integer of kind=MPI_ADDRESS_KIND.
>>>> Could you try adding something like this:
>>>>
>>>> integer :: dcsize
>>>> integer (kind=MPI_ADDRESS_KIND) :: ext
>>>>
>>>> call mpi_type_size( mpi_double_complex , dcsize , errorMPI)
>>>> ext = dcsize
>>>>
>>>> Thanks,
>>>> ~Jim.
>>>>
>>>> On 01/30/2011 02:15 AM, Michele Rosso wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> I am developing a subroutine to handle the communication inside a group
>>>>> of processors.
>>>>> The source code is attached.
>>>>>
>>>>> Such subroutine is contained in a module and accesses many of the data
>>>>> it needs and the header "mpi.h" from another module (pmu_var).
>>>>>
>>>>> As an input I have a 3D array (work1) which is allocated in the main
>>>>> program. As an output I have another 3D matrix (work2) which is
>>>>> allocated in the main program too. Both of them are of type complex and
>>>>> have intent INOUT (I wanna use the subroutine in a reversible way ).
>>>>>
>>>>> Since the data I wanna send are not contiguous, I defined several data
>>>>> types. Then I tested all of them with a simple send-receive
>>>>> communication in the group "mpi_comm_world".
>>>>> The problem arises when I tested the data type "temp3": the esecution of
>>>>> the program stops and I receive the error:
>>>>>
>>>>> rank 0 in job 8 enterprise_45569 caused collective abort of all ranks
>>>>> exit status of rank 0: killed by signal 9
>>>>>
>>>>> Notice that work1 and work2 have different size but the same shape and
>>>>> the data type should be coherent with them.
>>>>>
>>>>> Has anyone and idea of which the problem could be?
>>>>>
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Michele
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list