[mpich-discuss] DataType Problem

Michele Rosso michele.rosso84 at gmail.com
Thu Feb 3 12:39:56 CST 2011


Thank you Jim.
No problem: I am aware of the fact that parallel debugging is a though
topic!

Michele



-----Original Message-----
From: James Dinan <dinan at mcs.anl.gov>
Reply-to: mpich-discuss at mcs.anl.gov
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] DataType Problem
Date: Thu, 03 Feb 2011 12:35:49 -0600

Hi Michele,

I don't know of a good tutorial on parallel debugging.  Can others 
suggest something?

Gdb doesn't know how to debug MPI parallel executions, so you have a 
couple ways of using it:

1. Run each process in a separate instance of gdb.  If your client 
supports x forwarding you can launch each one in an xterm:

$ mpiexec -n 4 xterm -e gdb ./mpi_program

2. Attach gdb to running processes.  To do this, ssh to the compute node 
where the process of interest is running and find its pid using the unix 
ps command.  Then you can:

$ gdb --pid PID ./mpi_program

Good luck,
  ~Jim.

On 02/03/2011 12:24 PM, Michele Rosso wrote:
> Hi Jim,
>
> thanks for your replay.
> Well, I discovered that my program crashed when I attempted to free the
> data-types 'cause I used the following invocation:
>
> call mpi_type_free(datatype)
>
> instead of the correct one:
>
> call mpi_type_free(datatype,ierr)
>
> Also, I solved the problem of the error message "pending request".
> I was using incorrect sizes for the send and receive buffers. Still, I
> wonder why I did not receive the error message saying that.
>
> Then, I have no more problems. Thanks for your help.
> I got a final question: can you suggest me a good tutorial for parallel
> debugging? I know that a lot of people uses gdb(or ddd) for such purpose
> but I am not able to have it work. I do not need something powerful, but
> an easy tool just for simple checking of variables.
>
>
> Best,
>
> Michele
>
> -----Original Message-----
> From: James Dinan<dinan at mcs.anl.gov>
> Reply-to: mpich-discuss at mcs.anl.gov
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] DataType Problem
> Date: Thu, 03 Feb 2011 11:13:51 -0600
>
> Hi Michele,
>
> Why are you unable to free the MPI types?
>
> In general, it is safe to free datatypes when you are finished passing
> them into MPI calls.  The MPI implementation will continue to hold on to
> the datatypes internally if they are needed for communication you have
> already issued and free them only when all of those operations complete.
>
> Best,
>    ~Jim.
>
> On 01/31/2011 02:03 PM, Michele Rosso wrote:
>> Hi Jim,
>>
>> first of all thanks a lot for your help.
>> During the week end I completely rewrite the subroutine. Now it works
>> but still I have small problems.
>>
>> What I am trying to accomplish is a matrix transposition.
>> I need to perform a 3D FFT on a N^3 matrix, where N is always a power of
>> 2.
>> I am using a 2D domain decomposition. The problem is that the first fft
>> (from real to complex) results in N/2+1 points along the coordinate 1.
>> When I transpose from direction 1 to direction 2 (along columns in my
>> setup) not all the processors have the same amount of data.
>>
>> I attached the current subroutine I am using (do not consider sgt23_pmu
>> since it will serve for the transposition 2->3 and it is not complete
>> yet).
>>
>> If I perform the direct transposition (1-->2) with the new subroutine,
>> it works perfectly. The only problem is that it crashes if I try to free
>> the datatype at the end.
>>
>> If I perform the inverse transposition (2-->1), it works as expected
>> only if I use a number of processors which is a perfect square (say
>> 16 ).
>> If I try with 8 processors for example ( so the domain is decomposed
>> into beams with a rectangular base ) the inverse transposition does not
>> work anymore and i receive the following message:
>>
>> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
>> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1eb9390, scnts=0x1e9c780,
>> sdispls=0x1e9ee80, stypes=0x1e9eea0, rbuf=0x1e9dc70, rcnts=0x1e9eec0,
>> rdispls=0x1e9f100, rtypes=0x1e9ca90, comm=0x84000004) failed
>> (unknown)(): Pending request (no error)
>> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
>> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x1cf5370, scnts=0x1cf1780,
>> sdispls=0x1cf3e80, stypes=0x1cf3ea0, rbuf=0x1cf2c70, rcnts=0x1cf3ec0,
>> rdispls=0x1cf4100, rtypes=0x1cf1a70, comm=0x84000004) failed
>> (unknown)(): Pending request (no error)
>> rank 5 in job 26  enterprise_57863   caused collective abort of all
>> ranks
>>     exit status of rank 5: killed by signal 9
>> Fatal error in MPI_Alltoallw: Pending request (no error), error stack:
>> MPI_Alltoallw(485): MPI_Alltoallw(sbuf=0x218e5a0, scnts=0x2170780,
>> sdispls=0x2170a40, stypes=0x2170a60, rbuf=0x218c180, rcnts=0x2170a80,
>> rdispls=0x2170d10, rtypes=0x2170d30, comm=0x84000004) failed
>> (unknown)(): Pending request (no error)
>>
>> So my problems are essentially two:
>>
>> 1) Unable to free the datatypes
>>
>> 2) Unable to perform the backward transposition when N_proc is not
>> perfect square.
>>
>> Again, thanks a lot for your help,
>>
>> Michele
>>
>> On Mon, 2011-01-31 at 13:44 -0600, James Dinan wrote:
>>> Hi Michele,
>>>
>>> I've attached a small test case derived from what you sent.  This runs
>>> fine for me with the integer change suggested below.
>>>
>>> I'm still a little confused about the need for
>>> mpi_type_create_resized().  You're setting the lower bound to 1 and the
>>> extent to the size of a double complex.  These adjustments are in bytes,
>>> so if I'm interpreting this correctly you are effectively shifting the
>>> beginning of the data type 1 byte into the first value in the array and
>>> then accessing a full double complex from that location.  This seems
>>> like it's probably not what you would want to do.
>>>
>>> Could you explain the subset of the data you're trying to cover with the
>>> datatype?
>>>
>>> Thanks,
>>>     ~Jim.
>>>
>>> On 01/31/2011 11:13 AM, James Dinan wrote:
>>>> Hi Michele,
>>>>
>>>> Another quick comment:
>>>>
>>>> Don't forget to free your MPI datatypes when you're finished with them.
>>>> This shouldn't cause the error you're seeing, but it can be a resource
>>>> leak that builds up over time if you call this routine frequently.
>>>>
>>>> call mpi_type_free(temp, errorMPI)
>>>> call mpi_type_free(temp2, errorMPI)
>>>> call mpi_type_free(temp3, errorMPI)
>>>>
>>>> Best,
>>>> ~Jim.
>>>>
>>>> On 01/31/2011 11:07 AM, James Dinan wrote:
>>>>> Hi Michele,
>>>>>
>>>>> I'm looking this over and trying to put together a test case from the
>>>>> code you sent. One thing that looks questionable is the type for 'ext'.
>>>>> The call to mpi_type_size wants an integer, however the
>>>>> mpi_type_create_resized calls want an integer of kind=MPI_ADDRESS_KIND.
>>>>> Could you try adding something like this:
>>>>>
>>>>> integer :: dcsize
>>>>> integer (kind=MPI_ADDRESS_KIND) :: ext
>>>>>
>>>>> call mpi_type_size( mpi_double_complex , dcsize , errorMPI)
>>>>> ext = dcsize
>>>>>
>>>>> Thanks,
>>>>> ~Jim.
>>>>>
>>>>> On 01/30/2011 02:15 AM, Michele Rosso wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I am developing a subroutine to handle the communication inside a group
>>>>>> of processors.
>>>>>> The source code is attached.
>>>>>>
>>>>>> Such subroutine is contained in a module and accesses many of the data
>>>>>> it needs and the header "mpi.h" from another module (pmu_var).
>>>>>>
>>>>>> As an input I have a 3D array (work1) which is allocated in the main
>>>>>> program. As an output I have another 3D matrix (work2) which is
>>>>>> allocated in the main program too. Both of them are of type complex and
>>>>>> have intent INOUT (I wanna use the subroutine in a reversible way ).
>>>>>>
>>>>>> Since the data I wanna send are not contiguous, I defined several data
>>>>>> types. Then I tested all of them with a simple send-receive
>>>>>> communication in the group "mpi_comm_world".
>>>>>> The problem arises when I tested the data type "temp3": the esecution of
>>>>>> the program stops and I receive the error:
>>>>>>
>>>>>> rank 0 in job 8 enterprise_45569 caused collective abort of all ranks
>>>>>> exit status of rank 0: killed by signal 9
>>>>>>
>>>>>> Notice that work1 and work2 have different size but the same shape and
>>>>>> the data type should be coherent with them.
>>>>>>
>>>>>> Has anyone and idea of which the problem could be?
>>>>>>
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> Michele
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss




More information about the mpich-discuss mailing list