[MPICH] debug flag

Howard Pritchard howardp at cray.com
Mon May 28 17:53:47 CDT 2007


Hello Wei-keng,

Here is a way on xt/qk systems to compile with the debug mpich2 library:

1) do
    module show xt-mpt

    to see which mpich2 the system manager has made the default.

    For instance, on an internal system here at cray this command shows:

-------------------------------------------------------------------
/opt/modulefiles/xt-mpt/1.5.49:

setenv           MPT_DIR /opt/xt-mpt/1.5.49
setenv           MPICHBASEDIR /opt/xt-mpt/1.5.49/mpich2-64
setenv           MPICH_DIR /opt/xt-mpt/1.5.49/mpich2-64/P2
setenv           MPICH_DIR_FTN_DEFAULT64 /opt/xt-mpt/1.5.49/mpich2-64/P2W
prepend-path     LD_LIBRARY_PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/lib
prepend-path     PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/bin
prepend-path     MANPATH /opt/xt-mpt/1.5.49/mpich2-64/man
prepend-path     MANPATH /opt/xt-mpt/1.5.49/romio/man
prepend-path     PE_PRODUCT_LIST MPT
-------------------------------------------------------------------

The debug library you want to use is thus going to be picked up by the
mpicc installed at:

/opt/xt-mpt/1.5.49/mpich2-64/P2DB

2) Now with the cray compiler scripts like cc, ftn, etc.   you specify the
alternate location to use for compiling/linking by

cc -driverpath=/opt/xt-mpt/1.5.49/mpich2-64/P2DB/bin -o a.out.debug ......

or whichever path is appropriate for the xt-mpt installed on your system.

3) When you rerun the binary, you may want to set the MPICH_DBMASK
environment variable to 0x200.

I am pretty sure you are running out of memory, based on the area in
the ADIO_Calc_my_req where the error arises.  Clearly this is not a very
good way to report an oom condition.  I'll investigate.

You may be able to save some memory by tweaking the environment
variables controlling mpi buffer space.  Refer to the intro_mpi man page
on your xt/qk system.

Hope this helps,

Howard

Wei-keng Liao wrote:

>
> Well, I am aware of mpich2version, but unforturnately that command is 
> not available to users on that machine. The only commands avaliable to 
> me are
> mpicc, mpif77, mpif90, and mpicxx.
>
> Wei-keng
>
>
> On Fri, 25 May 2007, Anthony Chan wrote:
>
>>
>> <mpich2-install-dir>/bin/mpich2version may show if --enable-g is set.
>>
>> A.Chan
>>
>> On Fri, 25 May 2007, Wei-keng Liao wrote:
>>
>>>
>>> The problem is that I cannot run my own mpich on the machine. I can see
>>> the MPICH I am using is of version 2-1.0.2 from peeking at mpif90 
>>> script.
>>> Is there a way to know if it is built using --enable-g=dbg option 
>>> from the
>>> mpif90 script?
>>>
>>> I don't know if this help, but below is the whole error message:
>>>
>>> aborting job:
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process <id>
>>> (there are 4000 lines, each with a distinct id number)
>>>
>>> ----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------
>>>
>>> PROCESSOR [ 0]
>>> log_nid  =    15  phys_nid  = 0x98  host_id =   7691  host_pid  = 18545
>>> group_id = 12003  num_procs = 4000  rank    =     15  local_pid =    3
>>> base_node_index =    0   last_node_index = 1999
>>>
>>> text_base  = 0x00000000200000   text_len  = 0x00000000400000
>>> data_base  = 0x00000000600000   data_len  = 0x00000000a00000
>>> stack_base = 0x000000fec00000   stack_len = 0x00000001000000
>>> heap_base  = 0x00000001200000   heap_len  = 0x0000007b000000
>>>
>>> ss  = 0x000000000000001f  fs  = 000000000000000000  gs  = 
>>> 0x0000000000000017
>>> rip = 0x00000000002d46fe
>>> rdi = 0x0000000006133a90  rsi = 0xffffffffdc0003c2  rbp = 
>>> 0x00000000ffbf9d40
>>> rsp = 0x00000000ffbf9cc0  rbx = 0x0000000000000190  rdx = 
>>> 0x000000003eb08c39
>>> rcx = 0x0000000008ea18b0  rax = 0x0000000008ecff30  cs  = 
>>> 0x000000000000001f
>>> R8  = 0x0000000007ad2ab0  R9  = 0xfffffffffffffe0c  R10 = 
>>> 0x0000000008e6bd30
>>> R11 = 0x0000000000000262  R12 = 0x0000000000000a8c  R13 = 
>>> 0xfffffffff0538770
>>> R14 = 0x00000000fffffe0c  R15 = 0x0000000008ed3dc0
>>> rflg = 0x0000000000010206   prev_sp = 0x00000000ffbf9cc0
>>> error_code = 6
>>>
>>> SIGNAL #[11][Segmentation fault]  fault_address = 0xffffffff78ed4cc8
>>>   0xffbf9cc0  0x        ffbf9cf0 0x             fa0 0x       
>>> a00006b6c 0x     a8c3e9ab7ff
>>>   0xffbf9ce0  0x         8ed7c50 0x             7d0 0x               
>>> 0 0x    6b6c002d455b
>>>   0xffbf9d00  0x         8ea18b0 0x         8e6bd30 0x         
>>> 61338a0 0x             fa0
>>>   0xffbf9d20  0x               0 0x         61338a0 0x           
>>> 8036c 0x         8ec4390
>>>   0xffbf9d40  0x        ffbf9e80 0x          2d2280 0x         
>>> 8ecff30 0x           8036c
>>>   0xffbf9d60  0x             fa0 0x        ffbf9de4 0x        
>>> ffbf9de8 0x        ffbf9df0
>>>   0xffbf9d80  0x        ffbf9df8 0x               0 0x               
>>> 0 0x         8ebc680
>>>   0xffbf9da0  0x            1770 0x     7d000a39f88 0x               
>>> 0 0x      650048174f
>>>   0xffbf9dc0  0x  14fb184c000829 0x         6a93500 0x        
>>> ffbf9e30 0x          292e54
>>>   0xffbf9de0  0x               0 0x         8ed3dc0 0x             
>>> 7af 0x               0
>>>   0xffbf9e00  0x               0 0x         8ecc0a0 0x         
>>> 8ecff30 0x           8036c
>>>   0xffbf9e20  0x       100000014 0x         8e6bd30 0x         
>>> 8ea18b0 0x            1770
>>>   0xffbf9e40  0xffffffff6793163f 0x    6b6c00a39fa8 0x     
>>> fa00000000f 0x         61338a0
>>>   0xffbf9e60  0x        4c000829 0x          14fb18 0x              
>>> 65 0x               0
>>>   0xffbf9e80  0x        ffbf9ee0 0x          2a397c 0x          
>>> 866b60 0x        ffbf9eb0
>>>
>>>
>>> Stack Trace:  ------------------------------
>>> #0  0x00000000002d46fe in ADIOI_Calc_my_req()
>>> #1  0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>> #2  0x00000000002a397c in MPIOI_File_write_all()
>>> #3  0x00000000002a3a4a in PMPI_File_write_all()
>>> #4  0x00000000002913a8 in pmpi_file_write_all_()
>>> could not find symbol for addr 0x73696e6966204f49
>>> --------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 25 May 2007, Robert Latham wrote:
>>>
>>>> On Fri, May 25, 2007 at 03:56:16PM -0500, Wei-keng Liao wrote:
>>>>
>>>>>
>>>>> I have an MPI I/O application that runs fine up to 1000 processes, 
>>>>> but
>>>>> failed when using 4000 processes. Parts of error message are
>>>>>     ...
>>>>>     Stack Trace:  ------------------------------
>>>>>     #0  0x00000000002d46fe in ADIOI_Calc_my_req()
>>>>>     #1  0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>>>>     #2  0x00000000002a397c in MPIOI_File_write_all()
>>>>>     #3  0x00000000002a3a4a in PMPI_File_write_all()
>>>>>     #4  0x00000000002913a8 in pmpi_file_write_all_()
>>>>>     could not find symbol for addr 0x73696e6966204f49
>>>>>     aborting job:
>>>>>     application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1456
>>>>>     ...
>>>>>
>>>>> My question is what debug flags should I use for compiling and 
>>>>> running in
>>>>> order to help find what exact location in function 
>>>>> ADIOI_Calc_my_req()
>>>>> causes this error?
>>>>
>>>>
>>>> Hi Wei-keng
>>>>
>>>> If you build MPICH2 with --enable-g=dbg, then all of MPI will be built
>>>> with debugging symbols.   Be sure to 'make clean' first: the ROMIO
>>>> objects might not rebuild otherwise.
>>>>
>>>> I wonder what caused the abort?  maybe ADIOI_Malloc failed to allocate
>>>> memory?  Well, a stack trace with debugging symbols should be
>>>> interesting.
>>>>
>>>> ==rob
>>>>
>>>> -- 
>>>> Rob Latham
>>>> Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
>>>> Argonne National Lab, IL USA                 B29D F333 664A 4280 315B
>>>>
>>>
>>>
>>
>




More information about the mpich-discuss mailing list