[MPICH] debug flag
Rajeev Thakur
thakur at mcs.anl.gov
Wed May 30 13:05:03 CDT 2007
> I have written a short C code for this I/O pattern. ...
> Let me know if you would like a copy of it.
Of course!
> -----Original Message-----
> From: Wei-keng Liao [mailto:wkliao at ece.northwestern.edu]
> Sent: Wednesday, May 30, 2007 11:49 AM
> To: Rajeev Thakur
> Cc: mpich-discuss at mcs.anl.gov
> Subject: RE: [MPICH] debug flag
>
>
> I just got the results by disabling aggregation. The coredump was
> generated by rank 784 (out of 4000) and indicates the following info.
>
> ad_aggregate.c:242
> proc = -603978814 <-- !?
> off = -166212992 <-- !?
> min_st_offset = 0
> fd_len = 400
> fd_size = 262582 <-- should be 11000000
>
> going up one level to ad_write_coll.c:170
> below is some of variables set by
> ADIOI_Calc_my_off_len() at line 101
> count 1375000
> offset = 0
> start_offset = 407601600
> end_offset = -2149678961 <-- should be 839993999
> contig_access_count = 27500
>
> I suspect the file type is not flatten correctly.
>
> I have written a short C code for this I/O pattern. I ran it on 4000
> processes and it produced the same error. On 2000 processes, it ran
> fine, just like my program. Let me know if you would like a
> copy of it.
>
> Wei-keng
>
>
> On Tue, 29 May 2007, Rajeev Thakur wrote:
>
> > Can you try disabling aggregation and see if the error
> still remains. You
> > can disable it by creating an info object as follows and
> passing it to
> > File_set_view
> > MPI_Info_set(info, "cb_config_list", "*:*");
> >
> > Rajeev
> >
> >> -----Original Message-----
> >> From: owner-mpich-discuss at mcs.anl.gov
> >> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Wei-keng Liao
> >> Sent: Tuesday, May 29, 2007 1:33 AM
> >> To: Howard Pritchard
> >> Cc: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [MPICH] debug flag
> >>
> >> Howard,
> >>
> >> Thanks for this information. It is very helpful. I was able
> >> to find more
> >> details by using the debug built mpich. Below is what I found
> >> from the
> >> coredump that may help debugging the ROMIO source.
> >>
> >> 1. coredump is from MPI rank 2919. I allocated 4000 MPI processes,
> >> (2000 nodes, each nodes has 2 CPUs). I am checking
> >> mpich2-1.0.2 source.
> >>
> >> 2. MPI_Abort() is called at line 97 by function
> >> ADIOI_Calc_aggregator(),
> >> in file ad_aggregate.c where
> >> rank_index = 5335, fd->hints->cb_nodes = 2000, off =
> 2802007600,
> >> min_off = 0, fd_size = 525164 (fd_size should be 11000000))
> >>
> >> 3. It is function ADIOI_Calc_my_req() called
> >> ADIOI_Calc_aggregator() at
> >> line 240, from file ad_aggregate.c, where
> >> i = 0 in loop for (i=0; i < contig_access_count; i++)
> >> off = 2802007600, min_st_offset = 0, fd_len = 400,
> >> fd_size = 525164
> >> (fd_size should be 11000000)
> >>
> >> 4. It is function ADIOI_GEN_WriteStridedColl() called
> >> ADIOI_Calc_my_req()
> >> at line 170, from file ad_write_coll.c
> >> I would like to see what went wrong with fd_size from function
> >> ADIOI_Calc_file_domains() where fd_size is set and saw
> >> that fd_size is
> >> determined by st_offsets[] and end_offsets[] which depend
> >> on variables
> >> start_offset and end_offset.
> >>
> >> So, I went a few line up and checked the values for variables
> >> start_offset and end_offset. They were set by
> >> ADIOI_Calc_my_off_len()
> >> at line 101 and I found the value of end_offset must be wrong!
> >> end_offset should always >= start_offset, but the core
> shows that
> >> start_offset = 2802007600, end_offset = 244727039
> >>
> >> So, I looked into ADIOI_Calc_my_off_len() in
> >> ad_read_coll.c and checked
> >> variable end_offset_ptr which was set by variable
> >> end_offset at line
> >> 453, since filetype_size > 0 and filetype_is_contig == 0.
> >> Hence, the only place end_offset is set is at line 420:
> >> end_offset = off + frd_size - 1;
> >> end_offset is determined by off and frd_size. However, frd_size
> >> is declared as an integer. But end_offset is ADIO_Offset. Maybe
> >> it is an type overflow! At line 351, I can see a type cast
> >> frd_size = (int) (disp + flat_file->indices[i] + ...
> >>
> >> Something fishy here. Unfortunately, the coredump does
> >> not cover here.
> >> Look like an interactive debugging with a break point
> >> cannot be avoided.
> >>
> >> Wei-keng
> >>
> >>
> >> On Mon, 28 May 2007, Howard Pritchard wrote:
> >>
> >>> Hello Wei-keng,
> >>>
> >>> Here is a way on xt/qk systems to compile with the debug
> >> mpich2 library:
> >>>
> >>> 1) do
> >>> module show xt-mpt
> >>>
> >>> to see which mpich2 the system manager has made the default.
> >>>
> >>> For instance, on an internal system here at cray this
> >> command shows:
> >>>
> >>>
> -------------------------------------------------------------------
> >>> /opt/modulefiles/xt-mpt/1.5.49:
> >>>
> >>> setenv MPT_DIR /opt/xt-mpt/1.5.49
> >>> setenv MPICHBASEDIR /opt/xt-mpt/1.5.49/mpich2-64
> >>> setenv MPICH_DIR /opt/xt-mpt/1.5.49/mpich2-64/P2
> >>> setenv MPICH_DIR_FTN_DEFAULT64
> >> /opt/xt-mpt/1.5.49/mpich2-64/P2W
> >>> prepend-path LD_LIBRARY_PATH
> /opt/xt-mpt/1.5.49/mpich2-64/P2/lib
> >>> prepend-path PATH /opt/xt-mpt/1.5.49/mpich2-64/P2/bin
> >>> prepend-path MANPATH /opt/xt-mpt/1.5.49/mpich2-64/man
> >>> prepend-path MANPATH /opt/xt-mpt/1.5.49/romio/man
> >>> prepend-path PE_PRODUCT_LIST MPT
> >>>
> -------------------------------------------------------------------
> >>>
> >>> The debug library you want to use is thus going to be
> >> picked up by the
> >>> mpicc installed at:
> >>>
> >>> /opt/xt-mpt/1.5.49/mpich2-64/P2DB
> >>>
> >>> 2) Now with the cray compiler scripts like cc, ftn, etc.
> >> you specify the
> >>> alternate location to use for compiling/linking by
> >>>
> >>> cc -driverpath=/opt/xt-mpt/1.5.49/mpich2-64/P2DB/bin -o
> >> a.out.debug ......
> >>>
> >>> or whichever path is appropriate for the xt-mpt installed
> >> on your system.
> >>>
> >>> 3) When you rerun the binary, you may want to set the MPICH_DBMASK
> >>> environment variable to 0x200.
> >>>
> >>> I am pretty sure you are running out of memory, based on
> the area in
> >>> the ADIO_Calc_my_req where the error arises. Clearly this
> >> is not a very
> >>> good way to report an oom condition. I'll investigate.
> >>>
> >>> You may be able to save some memory by tweaking the environment
> >>> variables controlling mpi buffer space. Refer to the
> >> intro_mpi man page
> >>> on your xt/qk system.
> >>>
> >>> Hope this helps,
> >>>
> >>> Howard
> >>>
> >>> Wei-keng Liao wrote:
> >>>
> >>>>
> >>>> Well, I am aware of mpich2version, but unforturnately that
> >> command is not
> >>>> available to users on that machine. The only commands
> >> avaliable to me are
> >>>> mpicc, mpif77, mpif90, and mpicxx.
> >>>>
> >>>> Wei-keng
> >>>>
> >>>>
> >>>> On Fri, 25 May 2007, Anthony Chan wrote:
> >>>>
> >>>>>
> >>>>> <mpich2-install-dir>/bin/mpich2version may show if
> >> --enable-g is set.
> >>>>>
> >>>>> A.Chan
> >>>>>
> >>>>> On Fri, 25 May 2007, Wei-keng Liao wrote:
> >>>>>
> >>>>>>
> >>>>>> The problem is that I cannot run my own mpich on the
> >> machine. I can see
> >>>>>> the MPICH I am using is of version 2-1.0.2 from peeking
> >> at mpif90 script.
> >>>>>> Is there a way to know if it is built using
> >> --enable-g=dbg option from
> >>>>>> the
> >>>>>> mpif90 script?
> >>>>>>
> >>>>>> I don't know if this help, but below is the whole
> error message:
> >>>>>>
> >>>>>> aborting job:
> >>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process <id>
> >>>>>> (there are 4000 lines, each with a distinct id number)
> >>>>>>
> >>>>>> ----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------
> >>>>>>
> >>>>>> PROCESSOR [ 0]
> >>>>>> log_nid = 15 phys_nid = 0x98 host_id = 7691
> >> host_pid = 18545
> >>>>>> group_id = 12003 num_procs = 4000 rank = 15
> >> local_pid = 3
> >>>>>> base_node_index = 0 last_node_index = 1999
> >>>>>>
> >>>>>> text_base = 0x00000000200000 text_len = 0x00000000400000
> >>>>>> data_base = 0x00000000600000 data_len = 0x00000000a00000
> >>>>>> stack_base = 0x000000fec00000 stack_len = 0x00000001000000
> >>>>>> heap_base = 0x00000001200000 heap_len = 0x0000007b000000
> >>>>>>
> >>>>>> ss = 0x000000000000001f fs = 000000000000000000 gs =
> >>>>>> 0x0000000000000017
> >>>>>> rip = 0x00000000002d46fe
> >>>>>> rdi = 0x0000000006133a90 rsi = 0xffffffffdc0003c2 rbp =
> >>>>>> 0x00000000ffbf9d40
> >>>>>> rsp = 0x00000000ffbf9cc0 rbx = 0x0000000000000190 rdx =
> >>>>>> 0x000000003eb08c39
> >>>>>> rcx = 0x0000000008ea18b0 rax = 0x0000000008ecff30 cs =
> >>>>>> 0x000000000000001f
> >>>>>> R8 = 0x0000000007ad2ab0 R9 = 0xfffffffffffffe0c R10 =
> >>>>>> 0x0000000008e6bd30
> >>>>>> R11 = 0x0000000000000262 R12 = 0x0000000000000a8c R13 =
> >>>>>> 0xfffffffff0538770
> >>>>>> R14 = 0x00000000fffffe0c R15 = 0x0000000008ed3dc0
> >>>>>> rflg = 0x0000000000010206 prev_sp = 0x00000000ffbf9cc0
> >>>>>> error_code = 6
> >>>>>>
> >>>>>> SIGNAL #[11][Segmentation fault] fault_address =
> >> 0xffffffff78ed4cc8
> >>>>>> 0xffbf9cc0 0x ffbf9cf0 0x fa0 0x
> >> a00006b6c 0x
> >>>>>> a8c3e9ab7ff
> >>>>>> 0xffbf9ce0 0x 8ed7c50 0x 7d0 0x
> >> 0 0x
> >>>>>> 6b6c002d455b
> >>>>>> 0xffbf9d00 0x 8ea18b0 0x 8e6bd30 0x
> >> 61338a0 0x
> >>>>>> fa0
> >>>>>> 0xffbf9d20 0x 0 0x 61338a0 0x
> >> 8036c 0x
> >>>>>> 8ec4390
> >>>>>> 0xffbf9d40 0x ffbf9e80 0x 2d2280 0x
> >> 8ecff30 0x
> >>>>>> 8036c
> >>>>>> 0xffbf9d60 0x fa0 0x ffbf9de4 0x
> >> ffbf9de8 0x
> >>>>>> ffbf9df0
> >>>>>> 0xffbf9d80 0x ffbf9df8 0x 0 0x
> >> 0 0x
> >>>>>> 8ebc680
> >>>>>> 0xffbf9da0 0x 1770 0x 7d000a39f88 0x
> >> 0 0x
> >>>>>> 650048174f
> >>>>>> 0xffbf9dc0 0x 14fb184c000829 0x 6a93500 0x
> >> ffbf9e30 0x
> >>>>>> 292e54
> >>>>>> 0xffbf9de0 0x 0 0x 8ed3dc0 0x
> >> 7af 0x
> >>>>>> 0
> >>>>>> 0xffbf9e00 0x 0 0x 8ecc0a0 0x
> >> 8ecff30 0x
> >>>>>> 8036c
> >>>>>> 0xffbf9e20 0x 100000014 0x 8e6bd30 0x
> >> 8ea18b0 0x
> >>>>>> 1770
> >>>>>> 0xffbf9e40 0xffffffff6793163f 0x 6b6c00a39fa8 0x
> >> fa00000000f 0x
> >>>>>> 61338a0
> >>>>>> 0xffbf9e60 0x 4c000829 0x 14fb18 0x
> >> 65 0x
> >>>>>> 0
> >>>>>> 0xffbf9e80 0x ffbf9ee0 0x 2a397c 0x
> >> 866b60 0x
> >>>>>> ffbf9eb0
> >>>>>>
> >>>>>>
> >>>>>> Stack Trace: ------------------------------
> >>>>>> #0 0x00000000002d46fe in ADIOI_Calc_my_req()
> >>>>>> #1 0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
> >>>>>> #2 0x00000000002a397c in MPIOI_File_write_all()
> >>>>>> #3 0x00000000002a3a4a in PMPI_File_write_all()
> >>>>>> #4 0x00000000002913a8 in pmpi_file_write_all_()
> >>>>>> could not find symbol for addr 0x73696e6966204f49
> >>>>>> --------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, 25 May 2007, Robert Latham wrote:
> >>>>>>
> >>>>>>> On Fri, May 25, 2007 at 03:56:16PM -0500, Wei-keng Liao wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I have an MPI I/O application that runs fine up to
> >> 1000 processes, but
> >>>>>>>> failed when using 4000 processes. Parts of error message are
> >>>>>>>> ...
> >>>>>>>> Stack Trace: ------------------------------
> >>>>>>>> #0 0x00000000002d46fe in ADIOI_Calc_my_req()
> >>>>>>>> #1 0x00000000002d2280 in ADIOI_GEN_WriteStridedColl()
> >>>>>>>> #2 0x00000000002a397c in MPIOI_File_write_all()
> >>>>>>>> #3 0x00000000002a3a4a in PMPI_File_write_all()
> >>>>>>>> #4 0x00000000002913a8 in pmpi_file_write_all_()
> >>>>>>>> could not find symbol for addr 0x73696e6966204f49
> >>>>>>>> aborting job:
> >>>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) -
> >> process 1456
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> My question is what debug flags should I use for
> >> compiling and running
> >>>>>>>> in
> >>>>>>>> order to help find what exact location in function
> >> ADIOI_Calc_my_req()
> >>>>>>>> causes this error?
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Wei-keng
> >>>>>>>
> >>>>>>> If you build MPICH2 with --enable-g=dbg, then all of
> >> MPI will be built
> >>>>>>> with debugging symbols. Be sure to 'make clean'
> >> first: the ROMIO
> >>>>>>> objects might not rebuild otherwise.
> >>>>>>>
> >>>>>>> I wonder what caused the abort? maybe ADIOI_Malloc
> >> failed to allocate
> >>>>>>> memory? Well, a stack trace with debugging symbols should be
> >>>>>>> interesting.
> >>>>>>>
> >>>>>>> ==rob
> >>>>>>>
> >>>>>>> --
> >>>>>>> Rob Latham
> >>>>>>> Mathematics and Computer Science Division A215 0178
> >> EA2D B059 8CDF
> >>>>>>> Argonne National Lab, IL USA B29D F333
> >> 664A 4280 315B
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>
More information about the mpich-discuss
mailing list