[MPICH] debug flag

Wei-keng Liao wkliao at ece.northwestern.edu
Tue May 29 01:32:55 CDT 2007


Howard,

Thanks for this information. It is very helpful. I was able to find more 
details by using the debug built mpich. Below is what I found from the 
coredump that may help debugging the ROMIO source.

1. coredump is from MPI rank 2919. I allocated 4000 MPI processes,
    (2000 nodes, each nodes has 2 CPUs). I am checking mpich2-1.0.2 source.

2. MPI_Abort() is called at line 97 by function ADIOI_Calc_aggregator(),
    in file ad_aggregate.c where
    rank_index = 5335, fd->hints->cb_nodes = 2000, off = 2802007600,
    min_off = 0, fd_size = 525164 (fd_size should be 11000000))

3. It is function ADIOI_Calc_my_req() called ADIOI_Calc_aggregator() at
    line 240, from file ad_aggregate.c, where
    i = 0 in loop    for (i=0; i < contig_access_count; i++)
    off = 2802007600, min_st_offset = 0, fd_len = 400, fd_size = 525164
    (fd_size should be 11000000)

4. It is function ADIOI_GEN_WriteStridedColl() called ADIOI_Calc_my_req()
    at line 170, from file ad_write_coll.c
    I would like to see what went wrong with fd_size from function
    ADIOI_Calc_file_domains() where fd_size is set and saw that fd_size is
    determined by st_offsets[] and end_offsets[] which depend on variables
    start_offset and end_offset.

    So, I went a few line up and checked the values for variables
    start_offset and end_offset. They were set by ADIOI_Calc_my_off_len()
    at line 101 and I found the value of end_offset must be wrong!
    end_offset should always >= start_offset, but the core shows that
        start_offset = 2802007600, end_offset = 244727039

    So, I looked into ADIOI_Calc_my_off_len() in ad_read_coll.c and checked
    variable end_offset_ptr which was set by variable end_offset at line
    453, since filetype_size > 0 and filetype_is_contig == 0.
    Hence, the only place end_offset is set is at line 420:
        end_offset = off + frd_size - 1;
    end_offset is determined by off and frd_size. However, frd_size
    is declared as an integer. But end_offset is ADIO_Offset. Maybe
    it is an type overflow! At line 351, I can see a type cast
        frd_size = (int) (disp + flat_file->indices[i] + ...

    Something fishy here. Unfortunately, the coredump does not cover here.
    Look like an interactive debugging with a break point cannot be avoided.

Wei-keng


On Mon, 28 May 2007, Howard Pritchard wrote:

> Hello Wei-keng,
>
> Here is a way on xt/qk systems to compile with the debug mpich2 library:
>
> 1) do
>   module show xt-mpt
>
>   to see which mpich2 the system manager has made the default.
>
>   For instance, on an internal system here at cray this command shows:
>
> -------------------------------------------------------------------
> /opt/modulefiles/xt-mpt/1.5.49:
>
> setenv           MPT_DIR /opt/xt-mpt/1.5.49
> setenv           MPICHBASEDIR /opt/xt-mpt/1.5.49/mpich2-64
> setenv           MPICH_DIR /opt/xt-mpt/1.5.49/mpich2-64/P2
> setenv           MPICH_DIR_FTN_DEFAULT64 /opt/xt-mpt/1.5.49/mpich2-64/P2W
> prepend-path     LD_LIBRARY_PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/lib
> prepend-path     PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/bin
> prepend-path     MANPATH /opt/xt-mpt/1.5.49/mpich2-64/man
> prepend-path     MANPATH /opt/xt-mpt/1.5.49/romio/man
> prepend-path     PE_PRODUCT_LIST MPT
> -------------------------------------------------------------------
>
> The debug library you want to use is thus going to be picked up by the
> mpicc installed at:
>
> /opt/xt-mpt/1.5.49/mpich2-64/P2DB
>
> 2) Now with the cray compiler scripts like cc, ftn, etc.   you specify the
> alternate location to use for compiling/linking by
>
> cc -driverpath=/opt/xt-mpt/1.5.49/mpich2-64/P2DB/bin -o a.out.debug ......
>
> or whichever path is appropriate for the xt-mpt installed on your system.
>
> 3) When you rerun the binary, you may want to set the MPICH_DBMASK
> environment variable to 0x200.
>
> I am pretty sure you are running out of memory, based on the area in
> the ADIO_Calc_my_req where the error arises.  Clearly this is not a very
> good way to report an oom condition.  I'll investigate.
>
> You may be able to save some memory by tweaking the environment
> variables controlling mpi buffer space.  Refer to the intro_mpi man page
> on your xt/qk system.
>
> Hope this helps,
>
> Howard
>
> Wei-keng Liao wrote:
>
>> 
>> Well, I am aware of mpich2version, but unforturnately that command is not 
>> available to users on that machine. The only commands avaliable to me are
>> mpicc, mpif77, mpif90, and mpicxx.
>> 
>> Wei-keng
>> 
>> 
>> On Fri, 25 May 2007, Anthony Chan wrote:
>> 
>>> 
>>> <mpich2-install-dir>/bin/mpich2version may show if --enable-g is set.
>>> 
>>> A.Chan
>>> 
>>> On Fri, 25 May 2007, Wei-keng Liao wrote:
>>> 
>>>> 
>>>> The problem is that I cannot run my own mpich on the machine. I can see
>>>> the MPICH I am using is of version 2-1.0.2 from peeking at mpif90 script.
>>>> Is there a way to know if it is built using --enable-g=dbg option from 
>>>> the
>>>> mpif90 script?
>>>> 
>>>> I don't know if this help, but below is the whole error message:
>>>> 
>>>> aborting job:
>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process <id>
>>>> (there are 4000 lines, each with a distinct id number)
>>>> 
>>>> ----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------
>>>> 
>>>> PROCESSOR [ 0]
>>>> log_nid  =    15  phys_nid  = 0x98  host_id =   7691  host_pid  = 18545
>>>> group_id = 12003  num_procs = 4000  rank    =     15  local_pid =    3
>>>> base_node_index =    0   last_node_index = 1999
>>>> 
>>>> text_base  = 0x00000000200000   text_len  = 0x00000000400000
>>>> data_base  = 0x00000000600000   data_len  = 0x00000000a00000
>>>> stack_base = 0x000000fec00000   stack_len = 0x00000001000000
>>>> heap_base  = 0x00000001200000   heap_len  = 0x0000007b000000
>>>> 
>>>> ss  = 0x000000000000001f  fs  = 000000000000000000  gs  = 
>>>> 0x0000000000000017
>>>> rip = 0x00000000002d46fe
>>>> rdi = 0x0000000006133a90  rsi = 0xffffffffdc0003c2  rbp = 
>>>> 0x00000000ffbf9d40
>>>> rsp = 0x00000000ffbf9cc0  rbx = 0x0000000000000190  rdx = 
>>>> 0x000000003eb08c39
>>>> rcx = 0x0000000008ea18b0  rax = 0x0000000008ecff30  cs  = 
>>>> 0x000000000000001f
>>>> R8  = 0x0000000007ad2ab0  R9  = 0xfffffffffffffe0c  R10 = 
>>>> 0x0000000008e6bd30
>>>> R11 = 0x0000000000000262  R12 = 0x0000000000000a8c  R13 = 
>>>> 0xfffffffff0538770
>>>> R14 = 0x00000000fffffe0c  R15 = 0x0000000008ed3dc0
>>>> rflg = 0x0000000000010206   prev_sp = 0x00000000ffbf9cc0
>>>> error_code = 6
>>>> 
>>>> SIGNAL #[11][Segmentation fault]  fault_address = 0xffffffff78ed4cc8
>>>>   0xffbf9cc0  0x        ffbf9cf0 0x             fa0 0x       a00006b6c 0x 
>>>> a8c3e9ab7ff
>>>>   0xffbf9ce0  0x         8ed7c50 0x             7d0 0x               0 0x 
>>>> 6b6c002d455b
>>>>   0xffbf9d00  0x         8ea18b0 0x         8e6bd30 0x         61338a0 0x 
>>>> fa0
>>>>   0xffbf9d20  0x               0 0x         61338a0 0x           8036c 0x 
>>>> 8ec4390
>>>>   0xffbf9d40  0x        ffbf9e80 0x          2d2280 0x         8ecff30 0x 
>>>> 8036c
>>>>   0xffbf9d60  0x             fa0 0x        ffbf9de4 0x        ffbf9de8 0x 
>>>> ffbf9df0
>>>>   0xffbf9d80  0x        ffbf9df8 0x               0 0x               0 0x 
>>>> 8ebc680
>>>>   0xffbf9da0  0x            1770 0x     7d000a39f88 0x               0 0x 
>>>> 650048174f
>>>>   0xffbf9dc0  0x  14fb184c000829 0x         6a93500 0x        ffbf9e30 0x 
>>>> 292e54
>>>>   0xffbf9de0  0x               0 0x         8ed3dc0 0x             7af 0x 
>>>> 0
>>>>   0xffbf9e00  0x               0 0x         8ecc0a0 0x         8ecff30 0x 
>>>> 8036c
>>>>   0xffbf9e20  0x       100000014 0x         8e6bd30 0x         8ea18b0 0x 
>>>> 1770
>>>>   0xffbf9e40  0xffffffff6793163f 0x    6b6c00a39fa8 0x     fa00000000f 0x 
>>>> 61338a0
>>>>   0xffbf9e60  0x        4c000829 0x          14fb18 0x              65 0x 
>>>> 0
>>>>   0xffbf9e80  0x        ffbf9ee0 0x          2a397c 0x          866b60 0x 
>>>> ffbf9eb0
>>>> 
>>>> 
>>>> Stack Trace:  ------------------------------
>>>> #0  0x00000000002d46fe in ADIOI_Calc_my_req()
>>>> #1  0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>>> #2  0x00000000002a397c in MPIOI_File_write_all()
>>>> #3  0x00000000002a3a4a in PMPI_File_write_all()
>>>> #4  0x00000000002913a8 in pmpi_file_write_all_()
>>>> could not find symbol for addr 0x73696e6966204f49
>>>> --------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, 25 May 2007, Robert Latham wrote:
>>>> 
>>>>> On Fri, May 25, 2007 at 03:56:16PM -0500, Wei-keng Liao wrote:
>>>>> 
>>>>>> 
>>>>>> I have an MPI I/O application that runs fine up to 1000 processes, but
>>>>>> failed when using 4000 processes. Parts of error message are
>>>>>>     ...
>>>>>>     Stack Trace:  ------------------------------
>>>>>>     #0  0x00000000002d46fe in ADIOI_Calc_my_req()
>>>>>>     #1  0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>>>>>     #2  0x00000000002a397c in MPIOI_File_write_all()
>>>>>>     #3  0x00000000002a3a4a in PMPI_File_write_all()
>>>>>>     #4  0x00000000002913a8 in pmpi_file_write_all_()
>>>>>>     could not find symbol for addr 0x73696e6966204f49
>>>>>>     aborting job:
>>>>>>     application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1456
>>>>>>     ...
>>>>>> 
>>>>>> My question is what debug flags should I use for compiling and running 
>>>>>> in
>>>>>> order to help find what exact location in function ADIOI_Calc_my_req()
>>>>>> causes this error?
>>>>> 
>>>>> 
>>>>> Hi Wei-keng
>>>>> 
>>>>> If you build MPICH2 with --enable-g=dbg, then all of MPI will be built
>>>>> with debugging symbols.   Be sure to 'make clean' first: the ROMIO
>>>>> objects might not rebuild otherwise.
>>>>> 
>>>>> I wonder what caused the abort?  maybe ADIOI_Malloc failed to allocate
>>>>> memory?  Well, a stack trace with debugging symbols should be
>>>>> interesting.
>>>>> 
>>>>> ==rob
>>>>> 
>>>>> -- 
>>>>> Rob Latham
>>>>> Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
>>>>> Argonne National Lab, IL USA                 B29D F333 664A 4280 315B
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>




More information about the mpich-discuss mailing list