[MPICH] debug flag

Wei-keng Liao wkliao at ece.northwestern.edu
Wed May 30 11:48:32 CDT 2007


I just got the results by disabling aggregation. The coredump was 
generated by rank 784 (out of 4000) and indicates the following info.

ad_aggregate.c:242
     proc = -603978814     <-- !?
     off = -166212992      <-- !?
     min_st_offset = 0
     fd_len = 400
     fd_size = 262582  <-- should be 11000000

going up one level to ad_write_coll.c:170
     below is some of variables set by ADIOI_Calc_my_off_len() at line 101
     count 1375000
     offset = 0
     start_offset = 407601600
     end_offset = -2149678961    <-- should be 839993999
     contig_access_count = 27500

I suspect the file type is not flatten correctly.

I have written a short C code for this I/O pattern. I ran it on 4000 
processes and it produced the same error. On 2000 processes, it ran 
fine, just like my program. Let me know if you would like a copy of it.

Wei-keng


On Tue, 29 May 2007, Rajeev Thakur wrote:

> Can you try disabling aggregation and see if the error still remains. You
> can disable it by creating an info object as follows and passing it to
> File_set_view
>     MPI_Info_set(info, "cb_config_list", "*:*");
>
> Rajeev
>
>> -----Original Message-----
>> From: owner-mpich-discuss at mcs.anl.gov
>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Wei-keng Liao
>> Sent: Tuesday, May 29, 2007 1:33 AM
>> To: Howard Pritchard
>> Cc: mpich-discuss at mcs.anl.gov
>> Subject: Re: [MPICH] debug flag
>>
>> Howard,
>>
>> Thanks for this information. It is very helpful. I was able
>> to find more
>> details by using the debug built mpich. Below is what I found
>> from the
>> coredump that may help debugging the ROMIO source.
>>
>> 1. coredump is from MPI rank 2919. I allocated 4000 MPI processes,
>>     (2000 nodes, each nodes has 2 CPUs). I am checking
>> mpich2-1.0.2 source.
>>
>> 2. MPI_Abort() is called at line 97 by function
>> ADIOI_Calc_aggregator(),
>>     in file ad_aggregate.c where
>>     rank_index = 5335, fd->hints->cb_nodes = 2000, off = 2802007600,
>>     min_off = 0, fd_size = 525164 (fd_size should be 11000000))
>>
>> 3. It is function ADIOI_Calc_my_req() called
>> ADIOI_Calc_aggregator() at
>>     line 240, from file ad_aggregate.c, where
>>     i = 0 in loop    for (i=0; i < contig_access_count; i++)
>>     off = 2802007600, min_st_offset = 0, fd_len = 400,
>> fd_size = 525164
>>     (fd_size should be 11000000)
>>
>> 4. It is function ADIOI_GEN_WriteStridedColl() called
>> ADIOI_Calc_my_req()
>>     at line 170, from file ad_write_coll.c
>>     I would like to see what went wrong with fd_size from function
>>     ADIOI_Calc_file_domains() where fd_size is set and saw
>> that fd_size is
>>     determined by st_offsets[] and end_offsets[] which depend
>> on variables
>>     start_offset and end_offset.
>>
>>     So, I went a few line up and checked the values for variables
>>     start_offset and end_offset. They were set by
>> ADIOI_Calc_my_off_len()
>>     at line 101 and I found the value of end_offset must be wrong!
>>     end_offset should always >= start_offset, but the core shows that
>>         start_offset = 2802007600, end_offset = 244727039
>>
>>     So, I looked into ADIOI_Calc_my_off_len() in
>> ad_read_coll.c and checked
>>     variable end_offset_ptr which was set by variable
>> end_offset at line
>>     453, since filetype_size > 0 and filetype_is_contig == 0.
>>     Hence, the only place end_offset is set is at line 420:
>>         end_offset = off + frd_size - 1;
>>     end_offset is determined by off and frd_size. However, frd_size
>>     is declared as an integer. But end_offset is ADIO_Offset. Maybe
>>     it is an type overflow! At line 351, I can see a type cast
>>         frd_size = (int) (disp + flat_file->indices[i] + ...
>>
>>     Something fishy here. Unfortunately, the coredump does
>> not cover here.
>>     Look like an interactive debugging with a break point
>> cannot be avoided.
>>
>> Wei-keng
>>
>>
>> On Mon, 28 May 2007, Howard Pritchard wrote:
>>
>>> Hello Wei-keng,
>>>
>>> Here is a way on xt/qk systems to compile with the debug
>> mpich2 library:
>>>
>>> 1) do
>>>   module show xt-mpt
>>>
>>>   to see which mpich2 the system manager has made the default.
>>>
>>>   For instance, on an internal system here at cray this
>> command shows:
>>>
>>> -------------------------------------------------------------------
>>> /opt/modulefiles/xt-mpt/1.5.49:
>>>
>>> setenv           MPT_DIR /opt/xt-mpt/1.5.49
>>> setenv           MPICHBASEDIR /opt/xt-mpt/1.5.49/mpich2-64
>>> setenv           MPICH_DIR /opt/xt-mpt/1.5.49/mpich2-64/P2
>>> setenv           MPICH_DIR_FTN_DEFAULT64
>> /opt/xt-mpt/1.5.49/mpich2-64/P2W
>>> prepend-path     LD_LIBRARY_PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/lib
>>> prepend-path     PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/bin
>>> prepend-path     MANPATH /opt/xt-mpt/1.5.49/mpich2-64/man
>>> prepend-path     MANPATH /opt/xt-mpt/1.5.49/romio/man
>>> prepend-path     PE_PRODUCT_LIST MPT
>>> -------------------------------------------------------------------
>>>
>>> The debug library you want to use is thus going to be
>> picked up by the
>>> mpicc installed at:
>>>
>>> /opt/xt-mpt/1.5.49/mpich2-64/P2DB
>>>
>>> 2) Now with the cray compiler scripts like cc, ftn, etc.
>> you specify the
>>> alternate location to use for compiling/linking by
>>>
>>> cc -driverpath=/opt/xt-mpt/1.5.49/mpich2-64/P2DB/bin -o
>> a.out.debug ......
>>>
>>> or whichever path is appropriate for the xt-mpt installed
>> on your system.
>>>
>>> 3) When you rerun the binary, you may want to set the MPICH_DBMASK
>>> environment variable to 0x200.
>>>
>>> I am pretty sure you are running out of memory, based on the area in
>>> the ADIO_Calc_my_req where the error arises.  Clearly this
>> is not a very
>>> good way to report an oom condition.  I'll investigate.
>>>
>>> You may be able to save some memory by tweaking the environment
>>> variables controlling mpi buffer space.  Refer to the
>> intro_mpi man page
>>> on your xt/qk system.
>>>
>>> Hope this helps,
>>>
>>> Howard
>>>
>>> Wei-keng Liao wrote:
>>>
>>>>
>>>> Well, I am aware of mpich2version, but unforturnately that
>> command is not
>>>> available to users on that machine. The only commands
>> avaliable to me are
>>>> mpicc, mpif77, mpif90, and mpicxx.
>>>>
>>>> Wei-keng
>>>>
>>>>
>>>> On Fri, 25 May 2007, Anthony Chan wrote:
>>>>
>>>>>
>>>>> <mpich2-install-dir>/bin/mpich2version may show if
>> --enable-g is set.
>>>>>
>>>>> A.Chan
>>>>>
>>>>> On Fri, 25 May 2007, Wei-keng Liao wrote:
>>>>>
>>>>>>
>>>>>> The problem is that I cannot run my own mpich on the
>> machine. I can see
>>>>>> the MPICH I am using is of version 2-1.0.2 from peeking
>> at mpif90 script.
>>>>>> Is there a way to know if it is built using
>> --enable-g=dbg option from
>>>>>> the
>>>>>> mpif90 script?
>>>>>>
>>>>>> I don't know if this help, but below is the whole error message:
>>>>>>
>>>>>> aborting job:
>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process <id>
>>>>>> (there are 4000 lines, each with a distinct id number)
>>>>>>
>>>>>> ----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------
>>>>>>
>>>>>> PROCESSOR [ 0]
>>>>>> log_nid  =    15  phys_nid  = 0x98  host_id =   7691
>> host_pid  = 18545
>>>>>> group_id = 12003  num_procs = 4000  rank    =     15
>> local_pid =    3
>>>>>> base_node_index =    0   last_node_index = 1999
>>>>>>
>>>>>> text_base  = 0x00000000200000   text_len  = 0x00000000400000
>>>>>> data_base  = 0x00000000600000   data_len  = 0x00000000a00000
>>>>>> stack_base = 0x000000fec00000   stack_len = 0x00000001000000
>>>>>> heap_base  = 0x00000001200000   heap_len  = 0x0000007b000000
>>>>>>
>>>>>> ss  = 0x000000000000001f  fs  = 000000000000000000  gs  =
>>>>>> 0x0000000000000017
>>>>>> rip = 0x00000000002d46fe
>>>>>> rdi = 0x0000000006133a90  rsi = 0xffffffffdc0003c2  rbp =
>>>>>> 0x00000000ffbf9d40
>>>>>> rsp = 0x00000000ffbf9cc0  rbx = 0x0000000000000190  rdx =
>>>>>> 0x000000003eb08c39
>>>>>> rcx = 0x0000000008ea18b0  rax = 0x0000000008ecff30  cs  =
>>>>>> 0x000000000000001f
>>>>>> R8  = 0x0000000007ad2ab0  R9  = 0xfffffffffffffe0c  R10 =
>>>>>> 0x0000000008e6bd30
>>>>>> R11 = 0x0000000000000262  R12 = 0x0000000000000a8c  R13 =
>>>>>> 0xfffffffff0538770
>>>>>> R14 = 0x00000000fffffe0c  R15 = 0x0000000008ed3dc0
>>>>>> rflg = 0x0000000000010206   prev_sp = 0x00000000ffbf9cc0
>>>>>> error_code = 6
>>>>>>
>>>>>> SIGNAL #[11][Segmentation fault]  fault_address =
>> 0xffffffff78ed4cc8
>>>>>>   0xffbf9cc0  0x        ffbf9cf0 0x             fa0 0x
>>     a00006b6c 0x
>>>>>> a8c3e9ab7ff
>>>>>>   0xffbf9ce0  0x         8ed7c50 0x             7d0 0x
>>             0 0x
>>>>>> 6b6c002d455b
>>>>>>   0xffbf9d00  0x         8ea18b0 0x         8e6bd30 0x
>>       61338a0 0x
>>>>>> fa0
>>>>>>   0xffbf9d20  0x               0 0x         61338a0 0x
>>         8036c 0x
>>>>>> 8ec4390
>>>>>>   0xffbf9d40  0x        ffbf9e80 0x          2d2280 0x
>>       8ecff30 0x
>>>>>> 8036c
>>>>>>   0xffbf9d60  0x             fa0 0x        ffbf9de4 0x
>>      ffbf9de8 0x
>>>>>> ffbf9df0
>>>>>>   0xffbf9d80  0x        ffbf9df8 0x               0 0x
>>             0 0x
>>>>>> 8ebc680
>>>>>>   0xffbf9da0  0x            1770 0x     7d000a39f88 0x
>>             0 0x
>>>>>> 650048174f
>>>>>>   0xffbf9dc0  0x  14fb184c000829 0x         6a93500 0x
>>      ffbf9e30 0x
>>>>>> 292e54
>>>>>>   0xffbf9de0  0x               0 0x         8ed3dc0 0x
>>           7af 0x
>>>>>> 0
>>>>>>   0xffbf9e00  0x               0 0x         8ecc0a0 0x
>>       8ecff30 0x
>>>>>> 8036c
>>>>>>   0xffbf9e20  0x       100000014 0x         8e6bd30 0x
>>       8ea18b0 0x
>>>>>> 1770
>>>>>>   0xffbf9e40  0xffffffff6793163f 0x    6b6c00a39fa8 0x
>>   fa00000000f 0x
>>>>>> 61338a0
>>>>>>   0xffbf9e60  0x        4c000829 0x          14fb18 0x
>>            65 0x
>>>>>> 0
>>>>>>   0xffbf9e80  0x        ffbf9ee0 0x          2a397c 0x
>>        866b60 0x
>>>>>> ffbf9eb0
>>>>>>
>>>>>>
>>>>>> Stack Trace:  ------------------------------
>>>>>> #0  0x00000000002d46fe in ADIOI_Calc_my_req()
>>>>>> #1  0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>>>>> #2  0x00000000002a397c in MPIOI_File_write_all()
>>>>>> #3  0x00000000002a3a4a in PMPI_File_write_all()
>>>>>> #4  0x00000000002913a8 in pmpi_file_write_all_()
>>>>>> could not find symbol for addr 0x73696e6966204f49
>>>>>> --------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 25 May 2007, Robert Latham wrote:
>>>>>>
>>>>>>> On Fri, May 25, 2007 at 03:56:16PM -0500, Wei-keng Liao wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I have an MPI I/O application that runs fine up to
>> 1000 processes, but
>>>>>>>> failed when using 4000 processes. Parts of error message are
>>>>>>>>     ...
>>>>>>>>     Stack Trace:  ------------------------------
>>>>>>>>     #0  0x00000000002d46fe in ADIOI_Calc_my_req()
>>>>>>>>     #1  0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>>>>>>>     #2  0x00000000002a397c in MPIOI_File_write_all()
>>>>>>>>     #3  0x00000000002a3a4a in PMPI_File_write_all()
>>>>>>>>     #4  0x00000000002913a8 in pmpi_file_write_all_()
>>>>>>>>     could not find symbol for addr 0x73696e6966204f49
>>>>>>>>     aborting job:
>>>>>>>>     application called MPI_Abort(MPI_COMM_WORLD, 1) -
>> process 1456
>>>>>>>>     ...
>>>>>>>>
>>>>>>>> My question is what debug flags should I use for
>> compiling and running
>>>>>>>> in
>>>>>>>> order to help find what exact location in function
>> ADIOI_Calc_my_req()
>>>>>>>> causes this error?
>>>>>>>
>>>>>>>
>>>>>>> Hi Wei-keng
>>>>>>>
>>>>>>> If you build MPICH2 with --enable-g=dbg, then all of
>> MPI will be built
>>>>>>> with debugging symbols.   Be sure to 'make clean'
>> first: the ROMIO
>>>>>>> objects might not rebuild otherwise.
>>>>>>>
>>>>>>> I wonder what caused the abort?  maybe ADIOI_Malloc
>> failed to allocate
>>>>>>> memory?  Well, a stack trace with debugging symbols should be
>>>>>>> interesting.
>>>>>>>
>>>>>>> ==rob
>>>>>>>
>>>>>>> --
>>>>>>> Rob Latham
>>>>>>> Mathematics and Computer Science Division    A215 0178
>> EA2D B059 8CDF
>>>>>>> Argonne National Lab, IL USA                 B29D F333
>> 664A 4280 315B
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>




More information about the mpich-discuss mailing list