[mpich-discuss] DataType Problem

Pavan Balaji balaji at mcs.anl.gov
Thu Feb 3 12:39:57 CST 2011


Apart from what Jim suggested below, there are also a few parallel 
debuggers available. Totalview and DDT are commercial debuggers, while 
padb is a free one.

  -- Pavan

On 02/03/2011 12:35 PM, James Dinan wrote:
> Hi Michele,
>
> I don't know of a good tutorial on parallel debugging.  Can others
> suggest something?
>
> Gdb doesn't know how to debug MPI parallel executions, so you have a
> couple ways of using it:
>
> 1. Run each process in a separate instance of gdb.  If your client
> supports x forwarding you can launch each one in an xterm:
>
> $ mpiexec -n 4 xterm -e gdb ./mpi_program
>
> 2. Attach gdb to running processes.  To do this, ssh to the compute node
> where the process of interest is running and find its pid using the unix
> ps command.  Then you can:
>
> $ gdb --pid PID ./mpi_program
>
> Good luck,
>    ~Jim.
>
> On 02/03/2011 12:24 PM, Michele Rosso wrote:
>> Hi Jim,
>>
>> thanks for your replay.
>> Well, I discovered that my program crashed when I attempted to free the
>> data-types 'cause I used the following invocation:
>>
>> call mpi_type_free(datatype)
>>
>> instead of the correct one:
>>
>> call mpi_type_free(datatype,ierr)
>>
>> Also, I solved the problem of the error message "pending request".
>> I was using incorrect sizes for the send and receive buffers. Still, I
>> wonder why I did not receive the error message saying that.
>>
>> Then, I have no more problems. Thanks for your help.
>> I got a final question: can you suggest me a good tutorial for parallel
>> debugging? I know that a lot of people uses gdb(or ddd) for such purpose
>> but I am not able to have it work. I do not need something powerful, but
>> an easy tool just for simple checking of variables.
>>
>>
>> Best,
>>
>> Michele
>>
>> -----Original Message-----
>> From: James Dinan<dinan at mcs.anl.gov>
>> Reply-to: mpich-discuss at mcs.anl.gov
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] DataType Problem
>> Date: Thu, 03 Feb 2011 11:13:51 -0600
>>
>> Hi Michele,
>>
>> Why are you unable to free the MPI types?
>>
>> In general, it is safe to free datatypes when you are finished passing
>> them into MPI calls.  The MPI implementation will continue to hold on to
>> the datatypes internally if they are needed for communication you have
>> already issued and free them only when all of those operations complete.
>>
>> Best,
>>     ~Jim.
>>
>> On 01/31/2011 02:03 PM, Michele Rosso wrote:
>>> Hi Jim,
>>>
>>> first of all thanks a lot for your help.
>>> During the week end I completely rewrite the subroutine. Now it works
>>> but still I have small problems.
>>>
>>> What I am trying to accomplish is a matrix transposition.
>>> I need to perform a 3D FFT on a N^3 matrix, where N is always a power of
>>> 2.
>>> I am using a 2D domain decomposition. The problem is that the first fft
>>> (from real to complex) results in N/2+1 points along the coordinate 1.
>>> When I transpose from direction 1 to direction 2 (along columns in my
>>> setup) not all the processors have the same amount of data.
>>>
>>> I attached the current subroutine I am using (do not consider sgt23_pmu
>>> since it will serve for the transposition 2->3 and it is not complete
>>> yet).
>>>
>>> If I perform the direct transposition (1-->2) with the new subroutine,
>>> it works perfectly. The only problem is that it crashes if I try to free
>>> the datatype at the end.
>>>
>>> If I perform the inverse transposition (2-->1), it works as expected
>>> only if I use a number of processors which is a perfect square (say
>>> 16 ).
>>> If I try with 8 processors for example ( so the domain is decomposed
>>> into beams with a rectangular base ) the inverse transposition does not
>>> work anymore and i receive the following message:
>>>
>>> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
>>> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1eb9390, scnts=0x1e9c780,
>>> sdispls=0x1e9ee80, stypes=0x1e9eea0, rbuf=0x1e9dc70, rcnts=0x1e9eec0,
>>> rdispls=0x1e9f100, rtypes=0x1e9ca90, comm=0x84000004) failed
>>> (unknown)(): Pending request (no error)
>>> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
>>> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1cf5370, scnts=0x1cf1780,
>>> sdispls=0x1cf3e80, stypes=0x1cf3ea0, rbuf=0x1cf2c70, rcnts=0x1cf3ec0,
>>> rdispls=0x1cf4100, rtypes=0x1cf1a70, comm=0x84000004) failed
>>> (unknown)(): Pending request (no error)
>>> rank 5 in job 26  enterprise_57863   caused collective abort of all
>>> ranks
>>>      exit status of rank 5: killed by signal 9
>>> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
>>> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x218e5a0, scnts=0x2170780,
>>> sdispls=0x2170a40, stypes=0x2170a60, rbuf=0x218c180, rcnts=0x2170a80,
>>> rdispls=0x2170d10, rtypes=0x2170d30, comm=0x84000004) failed
>>> (unknown)(): Pending request (no error)
>>>
>>> So my problems are essentially two:
>>>
>>> 1) Unable to free the datatypes
>>>
>>> 2) Unable to perform the backward transposition when N_proc is not
>>> perfect square.
>>>
>>> Again, thanks a lot for your help,
>>>
>>> Michele
>>>
>>> On Mon, 2011-01-31 at 13:44 -0600, James Dinan wrote:
>>>> Hi Michele,
>>>>
>>>> I've attached a small test case derived from what you sent.  This runs
>>>> fine for me with the integer change suggested below.
>>>>
>>>> I'm still a little confused about the need for
>>>> mpi_type_create_resized().  You're setting the lower bound to 1 and the
>>>> extent to the size of a double complex.  These adjustments are in bytes,
>>>> so if I'm interpreting this correctly you are effectively shifting the
>>>> beginning of the data type 1 byte into the first value in the array and
>>>> then accessing a full double complex from that location.  This seems
>>>> like it's probably not what you would want to do.
>>>>
>>>> Could you explain the subset of the data you're trying to cover with the
>>>> datatype?
>>>>
>>>> Thanks,
>>>>      ~Jim.
>>>>
>>>> On 01/31/2011 11:13 AM, James Dinan wrote:
>>>>> Hi Michele,
>>>>>
>>>>> Another quick comment:
>>>>>
>>>>> Don't forget to free your MPI datatypes when you're finished with them.
>>>>> This shouldn't cause the error you're seeing, but it can be a resource
>>>>> leak that builds up over time if you call this routine frequently.
>>>>>
>>>>> call mpi_type_free(temp, errorMPI)
>>>>> call mpi_type_free(temp2, errorMPI)
>>>>> call mpi_type_free(temp3, errorMPI)
>>>>>
>>>>> Best,
>>>>> ~Jim.
>>>>>
>>>>> On 01/31/2011 11:07 AM, James Dinan wrote:
>>>>>> Hi Michele,
>>>>>>
>>>>>> I'm looking this over and trying to put together a test case from the
>>>>>> code you sent. One thing that looks questionable is the type for 'ext'.
>>>>>> The call to mpi_type_size wants an integer, however the
>>>>>> mpi_type_create_resized calls want an integer of kind=MPI_ADDRESS_KIND.
>>>>>> Could you try adding something like this:
>>>>>>
>>>>>> integer :: dcsize
>>>>>> integer (kind=MPI_ADDRESS_KIND) :: ext
>>>>>>
>>>>>> call mpi_type_size( mpi_double_complex , dcsize , errorMPI)
>>>>>> ext = dcsize
>>>>>>
>>>>>> Thanks,
>>>>>> ~Jim.
>>>>>>
>>>>>> On 01/30/2011 02:15 AM, Michele Rosso wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I am developing a subroutine to handle the communication inside a group
>>>>>>> of processors.
>>>>>>> The source code is attached.
>>>>>>>
>>>>>>> Such subroutine is contained in a module and accesses many of the data
>>>>>>> it needs and the header "mpi.h" from another module (pmu_var).
>>>>>>>
>>>>>>> As an input I have a 3D array (work1) which is allocated in the main
>>>>>>> program. As an output I have another 3D matrix (work2) which is
>>>>>>> allocated in the main program too. Both of them are of type complex and
>>>>>>> have intent INOUT (I wanna use the subroutine in a reversible way ).
>>>>>>>
>>>>>>> Since the data I wanna send are not contiguous, I defined several data
>>>>>>> types. Then I tested all of them with a simple send-receive
>>>>>>> communication in the group "mpi_comm_world".
>>>>>>> The problem arises when I tested the data type "temp3": the esecution of
>>>>>>> the program stops and I receive the error:
>>>>>>>
>>>>>>> rank 0 in job 8 enterprise_45569 caused collective abort of all ranks
>>>>>>> exit status of rank 0: killed by signal 9
>>>>>>>
>>>>>>> Notice that work1 and work2 have different size but the same shape and
>>>>>>> the data type should be coherent with them.
>>>>>>>
>>>>>>> Has anyone and idea of which the problem could be?
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> Michele
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list