[MPICH] debug flag
Howard Pritchard
howardp at cray.com
Mon May 28 17:53:47 CDT 2007
Hello Wei-keng,
Here is a way on xt/qk systems to compile with the debug mpich2 library:
1) do
module show xt-mpt
to see which mpich2 the system manager has made the default.
For instance, on an internal system here at cray this command shows:
-------------------------------------------------------------------
/opt/modulefiles/xt-mpt/1.5.49:
setenv MPT_DIR /opt/xt-mpt/1.5.49
setenv MPICHBASEDIR /opt/xt-mpt/1.5.49/mpich2-64
setenv MPICH_DIR /opt/xt-mpt/1.5.49/mpich2-64/P2
setenv MPICH_DIR_FTN_DEFAULT64 /opt/xt-mpt/1.5.49/mpich2-64/P2W
prepend-path LD_LIBRARY_PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/lib
prepend-path PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/bin
prepend-path MANPATH /opt/xt-mpt/1.5.49/mpich2-64/man
prepend-path MANPATH /opt/xt-mpt/1.5.49/romio/man
prepend-path PE_PRODUCT_LIST MPT
-------------------------------------------------------------------
The debug library you want to use is thus going to be picked up by the
mpicc installed at:
/opt/xt-mpt/1.5.49/mpich2-64/P2DB
2) Now with the cray compiler scripts like cc, ftn, etc. you specify the
alternate location to use for compiling/linking by
cc -driverpath=/opt/xt-mpt/1.5.49/mpich2-64/P2DB/bin -o a.out.debug ......
or whichever path is appropriate for the xt-mpt installed on your system.
3) When you rerun the binary, you may want to set the MPICH_DBMASK
environment variable to 0x200.
I am pretty sure you are running out of memory, based on the area in
the ADIO_Calc_my_req where the error arises. Clearly this is not a very
good way to report an oom condition. I'll investigate.
You may be able to save some memory by tweaking the environment
variables controlling mpi buffer space. Refer to the intro_mpi man page
on your xt/qk system.
Hope this helps,
Howard
Wei-keng Liao wrote:
>
> Well, I am aware of mpich2version, but unforturnately that command is
> not available to users on that machine. The only commands avaliable to
> me are
> mpicc, mpif77, mpif90, and mpicxx.
>
> Wei-keng
>
>
> On Fri, 25 May 2007, Anthony Chan wrote:
>
>>
>> <mpich2-install-dir>/bin/mpich2version may show if --enable-g is set.
>>
>> A.Chan
>>
>> On Fri, 25 May 2007, Wei-keng Liao wrote:
>>
>>>
>>> The problem is that I cannot run my own mpich on the machine. I can see
>>> the MPICH I am using is of version 2-1.0.2 from peeking at mpif90
>>> script.
>>> Is there a way to know if it is built using --enable-g=dbg option
>>> from the
>>> mpif90 script?
>>>
>>> I don't know if this help, but below is the whole error message:
>>>
>>> aborting job:
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process <id>
>>> (there are 4000 lines, each with a distinct id number)
>>>
>>> ----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------
>>>
>>> PROCESSOR [ 0]
>>> log_nid = 15 phys_nid = 0x98 host_id = 7691 host_pid = 18545
>>> group_id = 12003 num_procs = 4000 rank = 15 local_pid = 3
>>> base_node_index = 0 last_node_index = 1999
>>>
>>> text_base = 0x00000000200000 text_len = 0x00000000400000
>>> data_base = 0x00000000600000 data_len = 0x00000000a00000
>>> stack_base = 0x000000fec00000 stack_len = 0x00000001000000
>>> heap_base = 0x00000001200000 heap_len = 0x0000007b000000
>>>
>>> ss = 0x000000000000001f fs = 000000000000000000 gs =
>>> 0x0000000000000017
>>> rip = 0x00000000002d46fe
>>> rdi = 0x0000000006133a90 rsi = 0xffffffffdc0003c2 rbp =
>>> 0x00000000ffbf9d40
>>> rsp = 0x00000000ffbf9cc0 rbx = 0x0000000000000190 rdx =
>>> 0x000000003eb08c39
>>> rcx = 0x0000000008ea18b0 rax = 0x0000000008ecff30 cs =
>>> 0x000000000000001f
>>> R8 = 0x0000000007ad2ab0 R9 = 0xfffffffffffffe0c R10 =
>>> 0x0000000008e6bd30
>>> R11 = 0x0000000000000262 R12 = 0x0000000000000a8c R13 =
>>> 0xfffffffff0538770
>>> R14 = 0x00000000fffffe0c R15 = 0x0000000008ed3dc0
>>> rflg = 0x0000000000010206 prev_sp = 0x00000000ffbf9cc0
>>> error_code = 6
>>>
>>> SIGNAL #[11][Segmentation fault] fault_address = 0xffffffff78ed4cc8
>>> 0xffbf9cc0 0x ffbf9cf0 0x fa0 0x
>>> a00006b6c 0x a8c3e9ab7ff
>>> 0xffbf9ce0 0x 8ed7c50 0x 7d0 0x
>>> 0 0x 6b6c002d455b
>>> 0xffbf9d00 0x 8ea18b0 0x 8e6bd30 0x
>>> 61338a0 0x fa0
>>> 0xffbf9d20 0x 0 0x 61338a0 0x
>>> 8036c 0x 8ec4390
>>> 0xffbf9d40 0x ffbf9e80 0x 2d2280 0x
>>> 8ecff30 0x 8036c
>>> 0xffbf9d60 0x fa0 0x ffbf9de4 0x
>>> ffbf9de8 0x ffbf9df0
>>> 0xffbf9d80 0x ffbf9df8 0x 0 0x
>>> 0 0x 8ebc680
>>> 0xffbf9da0 0x 1770 0x 7d000a39f88 0x
>>> 0 0x 650048174f
>>> 0xffbf9dc0 0x 14fb184c000829 0x 6a93500 0x
>>> ffbf9e30 0x 292e54
>>> 0xffbf9de0 0x 0 0x 8ed3dc0 0x
>>> 7af 0x 0
>>> 0xffbf9e00 0x 0 0x 8ecc0a0 0x
>>> 8ecff30 0x 8036c
>>> 0xffbf9e20 0x 100000014 0x 8e6bd30 0x
>>> 8ea18b0 0x 1770
>>> 0xffbf9e40 0xffffffff6793163f 0x 6b6c00a39fa8 0x
>>> fa00000000f 0x 61338a0
>>> 0xffbf9e60 0x 4c000829 0x 14fb18 0x
>>> 65 0x 0
>>> 0xffbf9e80 0x ffbf9ee0 0x 2a397c 0x
>>> 866b60 0x ffbf9eb0
>>>
>>>
>>> Stack Trace: ------------------------------
>>> #0 0x00000000002d46fe in ADIOI_Calc_my_req()
>>> #1 0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>> #2 0x00000000002a397c in MPIOI_File_write_all()
>>> #3 0x00000000002a3a4a in PMPI_File_write_all()
>>> #4 0x00000000002913a8 in pmpi_file_write_all_()
>>> could not find symbol for addr 0x73696e6966204f49
>>> --------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 25 May 2007, Robert Latham wrote:
>>>
>>>> On Fri, May 25, 2007 at 03:56:16PM -0500, Wei-keng Liao wrote:
>>>>
>>>>>
>>>>> I have an MPI I/O application that runs fine up to 1000 processes,
>>>>> but
>>>>> failed when using 4000 processes. Parts of error message are
>>>>> ...
>>>>> Stack Trace: ------------------------------
>>>>> #0 0x00000000002d46fe in ADIOI_Calc_my_req()
>>>>> #1 0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
>>>>> #2 0x00000000002a397c in MPIOI_File_write_all()
>>>>> #3 0x00000000002a3a4a in PMPI_File_write_all()
>>>>> #4 0x00000000002913a8 in pmpi_file_write_all_()
>>>>> could not find symbol for addr 0x73696e6966204f49
>>>>> aborting job:
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1456
>>>>> ...
>>>>>
>>>>> My question is what debug flags should I use for compiling and
>>>>> running in
>>>>> order to help find what exact location in function
>>>>> ADIOI_Calc_my_req()
>>>>> causes this error?
>>>>
>>>>
>>>> Hi Wei-keng
>>>>
>>>> If you build MPICH2 with --enable-g=dbg, then all of MPI will be built
>>>> with debugging symbols. Be sure to 'make clean' first: the ROMIO
>>>> objects might not rebuild otherwise.
>>>>
>>>> I wonder what caused the abort? maybe ADIOI_Malloc failed to allocate
>>>> memory? Well, a stack trace with debugging symbols should be
>>>> interesting.
>>>>
>>>> ==rob
>>>>
>>>> --
>>>> Rob Latham
>>>> Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
>>>> Argonne National Lab, IL USA B29D F333 664A 4280 315B
>>>>
>>>
>>>
>>
>
More information about the mpich-discuss
mailing list