[MPICH] non-blocking sending/receiving an array

Manal Helal manalorama at gmail.com
Tue Jun 12 20:19:05 CDT 2007


Hi

I am actually still having this packed send/receive problem, but it happens
sometimes, and then works fine some other times, lately it works fine only
if I use the following running command:

mpirun -np 4 valgrind --leak-check=full -v --log-file=val3.out myprog
myprogarguments

like when I run with valgrind, it is alright, and I think it is all about
pointers being shifted while receiving the packed array whether blocking or
non-blocking,MPI_Recv or MPI_Irecv, I  will need to run on  high performance
machine, and won't be able to run it with valgrind there, and need to make
sure the program is stable and can run on large data sizes without problems,


each process is multi-threaded in my program, but I tried to run the program
all sequential within the process (no threads), and the problem is still the
same, so, it is not about thread-safety or synchronization,

I am copying the gcc list, may be I can get some insight about the problem,
and also some alternatives to ANSI C atoi or sprintf alternative, because
some of the valgrind problems are caused by sprintf, and so far I couldn't
find a safe alternative, the way I use sprintf now is for example:

#define SHORT_MESSAGE_SIZE 200
char msg[SHORT_MESSAGE_SIZE];
sprintf (msg, "%ld: add OC w %ld, pi %ld, ci %ld, cs %ld, dp %d af %d ",
OCout_ub, waveNo, partIndex, cellIndex,cellScore, depProc, addflag);

then I print the msg to a debugging file corresponding to the process and
the thread it came out from,

the valgrind output is as shown below if you are interested to have a look,
mostly are mpi library implementation problems, rather than mine, however,
both problems, don't seem to cause all this memory-shifting.



==4138== Memcheck, a memory error detector.
==4138== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==4138== Using LibVEX rev 1732, a library for dynamic binary translation.
==4138== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==4138== Using valgrind-3.2.3, a dynamic binary instrumentation framework.
==4138== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==4138==
--4138-- Startup, with flags:
--4138--    --leak-check=full
--4138--    -v
--4138--    --log-file=val3.out
--4138-- Contents of /proc/version:
--4138--   Linux version 2.6.21-1.3194.fc7 (
kojibuilder at xenbuilder4.fedora.phx.redhat.com) (gcc version 4.1.2 20070502
(Red Hat 4.1.2-12)) #1 SMP Wed May 23 22:35:01 EDT 2007
--4138-- Arch and hwcaps: X86, x86-sse1-sse2
--4138-- Page sizes: currently 4096, max supported 4096
--4138-- Valgrind library directory: /usr/lib/valgrind
--4138-- Reading syms from /home/mhelal/thesis/exp/ver2.1/mmDst (0x8048000)
--4138-- Reading syms from /usr/lib/valgrind/x86-linux/memcheck (0x38000000)
--4138--    object doesn't have a dynamic symbol table
--4138-- Reading syms from /lib/ld-2.6.so (0x46C44000)
--4138-- Reading suppressions file: /usr/lib/valgrind/default.supp
--4138-- REDIR: 0x46C596F0 (index) redirected to 0x38027EDF
(vgPlain_x86_linux_REDIR_FOR_index)
--4138-- Reading syms from /usr/lib/valgrind/x86-linux/vgpreload_core.so
(0x4001000)
--4138-- Reading syms from /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so
(0x4003000)
==4138== WARNING: new redirection conflicts with existing -- ignoring it
--4138--     new: 0x46C596F0 (index     ) R-> 0x040061F0 index
--4138-- REDIR: 0x46C59890 (strlen) redirected to 0x40062A0 (strlen)
--4138-- Reading syms from /lib/libm-2.6.so (0x4776B000)
--4138-- Reading syms from /lib/libpthread-2.6.so (0x479B7000)
--4138-- Reading syms from /home/mhelal/Install/mpi/lib/libmpich.so
(0x4017000)
--4138-- Reading syms from /lib/librt-2.6.so (0x46CC5000)
--4138-- Reading syms from /lib/libc-2.6.so (0x47615000)
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBDB: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBE3: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4ED25: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4F01B: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4F4F0: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C478D8: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBDB: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C47A84: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4EBE3: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C47A84: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
==4138==
==4138== Conditional jump or move depends on uninitialised value(s)
==4138==    at 0x46C4ED25: _dl_relocate_object (in /lib/ld-2.6.so)
==4138==    by 0x46C47A84: dl_main (in /lib/ld-2.6.so)
==4138==    by 0x46C57F6A: _dl_sysdep_start (in /lib/ld-2.6.so)
==4138==    by 0x46C452B7: _dl_start (in /lib/ld-2.6.so)
==4138==    by 0x46C44816: (within /lib/ld-2.6.so)
--4138-- REDIR: 0x47684810 (memset) redirected to 0x4006600 (memset)
--4138-- REDIR: 0x47684D00 (memcpy) redirected to 0x4007030 (memcpy)
--4138-- REDIR: 0x47683930 (rindex) redirected to 0x40060D0 (rindex)
--4138-- REDIR: 0x4767EC90 (calloc) redirected to 0x400478D (calloc)
--4138-- REDIR: 0x47683590 (strlen) redirected to 0x4006280 (strlen)
--4138-- REDIR: 0x47683780 (strncmp) redirected to 0x40062E0 (strncmp)
--4138-- REDIR: 0x4767EF90 (malloc) redirected to 0x4005460 (malloc)
--4138-- REDIR: 0x476804F0 (free) redirected to 0x400507A (free)
--4138-- REDIR: 0x47684310 (memchr) redirected to 0x4006470 (memchr)
--4138-- REDIR: 0x47683880 (strncpy) redirected to 0x40068D0 (strncpy)
--4138-- REDIR: 0x47682EC0 (index) redirected to 0x40061C0 (index)
--4138-- REDIR: 0x476830A0 (strcpy) redirected to 0x4007290 (strcpy)
--4138-- REDIR: 0x47684870 (mempcpy) redirected to 0x4006B10 (mempcpy)
--4138-- REDIR: 0x47683030 (strcmp) redirected to 0x4006350 (strcmp)
==4138==
==4138== Syscall param writev(vector[...]) points to uninitialised byte(s)
==4138==    at 0x476DE118: writev (in /lib/libc-2.6.so)
==4138==    by 0x41056E8: MPIDU_Socki_handle_write (sock_wait.i:689)
==4138==    by 0x41044E3: MPIDU_Sock_wait (sock_wait.i:329)
==4138==    by 0x406E66E: MPIDI_CH3_Progress_wait (ch3_progress.c:189)
==4138==    by 0x40B52FF: MPIC_Wait (helper_fns.c:275)
==4138==    by 0x40B4C0B: MPIC_Sendrecv (helper_fns.c:121)
==4138==    by 0x405904A: MPIR_Allreduce (allreduce.c:284)
==4138==    by 0x405AA0D: PMPI_Allreduce (allreduce.c:684)
==4138==    by 0x4091B30: MPIR_Get_contextid (commutil.c:384)
==4138==    by 0x4089EB4: PMPI_Comm_create (comm_create.c:121)
==4138==    by 0x804B817: main (main.c:513)
==4138==  Address 0x41922E0 is 32 bytes inside a block of size 72 alloc'd
==4138==    at 0x40054E5: malloc (vg_replace_malloc.c:149)
==4138==    by 0x4071262: MPIDI_CH3I_Connection_alloc
(ch3u_connect_sock.c:125)
==4138==    by 0x4073080: MPIDI_CH3I_VC_post_sockconnect
(ch3u_connect_sock.c:1023)
==4138==    by 0x406F8C4: MPIDI_CH3I_VC_post_connect (ch3_progress.c:857)
==4138==    by 0x406D5E2: MPIDI_CH3_iSendv (ch3_isendv.c:194)
==4138==    by 0x4073A1C: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:460)
==4138==    by 0x40C66F4: MPID_Isend (mpid_isend.c:117)
==4138==    by 0x40B4BB0: MPIC_Sendrecv (helper_fns.c:117)
==4138==    by 0x405904A: MPIR_Allreduce (allreduce.c:284)
==4138==    by 0x405AA0D: PMPI_Allreduce (allreduce.c:684)
==4138==    by 0x4091B30: MPIR_Get_contextid (commutil.c:384)
==4138==    by 0x4089EB4: PMPI_Comm_create (comm_create.c:121)
==4138==
==4138== Syscall param writev(vector[...]) points to uninitialised byte(s)
==4138==    at 0x476DE118: writev (in /lib/libc-2.6.so)
==4138==    by 0x41033C2: MPIDU_Sock_writev (sock_immed.i:604)
==4138==    by 0x406D08A: MPIDI_CH3_iSendv (ch3_isendv.c:83)
==4138==    by 0x4073A1C: MPIDI_CH3_EagerContigIsend (ch3u_eager.c:460)
==4138==    by 0x40C66F4: MPID_Isend (mpid_isend.c:117)
==4138==    by 0x40B4BB0: MPIC_Sendrecv (helper_fns.c:117)
==4138==    by 0x405904A: MPIR_Allreduce (allreduce.c:284)
==4138==    by 0x405AA0D: PMPI_Allreduce (allreduce.c:684)
==4138==    by 0x4091B30: MPIR_Get_contextid (commutil.c:384)
==4138==    by 0x4089EB4: PMPI_Comm_create (comm_create.c:121)
==4138==    by 0x804B817: main (main.c:513)
==4138==  Address 0xBEF02118 is on thread 1's stack
--4138-- REDIR: 0x476806E0 (realloc) redirected to 0x400550F (realloc)
==4138==
==4138== Thread 2:
==4138== Source and destination overlap in mempcpy(0x4C8BAA8, 0x4C8BAA8, 24)
==4138==    at 0x4006B94: mempcpy (mc_replace_strmem.c:116)
==4138==    by 0x47679314: _IO_default_xsputn (in /lib/libc-2.6.so)
==4138==    by 0x476544ED: vfprintf (in /lib/libc-2.6.so)
==4138==    by 0x4766E4CB: vsprintf (in /lib/libc-2.6.so)
==4138==    by 0x4765A0BD: sprintf (in /lib/libc-2.6.so)
==4138==    by 0x80589D5: getPrevCells (scoring.c:230)
==4138==    by 0x8058EF4: getScore (scoring.c:305)
==4138==    by 0x80599F3: ComputePartitionScores (scoring.c:470)
==4138==    by 0x804B215: ScoreCompThread (main.c:392)
==4138==    by 0x479BC2FA: start_thread (in /lib/libpthread-2.6.so)
==4138==    by 0x476E593D: clone (in /lib/libc-2.6.so)

On 17/05/07, Blankenship, David <David.Blankenship at kla-tencor.com> wrote:
>
> I am doing the same type of thing with the blocking calls. Here is how
> I am doing it. This code uses the C++ MPI interface.
>
> // Probe for a message from any source
> MPI::COMM_WORLD.Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, cMPIStatus );
> int iMessageLength = cMPIStatus.Get_count( MPI_CHAR );
> // Here I resize my receive buffer if necessary
>
> // Receive the message that was just probed
> int iSource = cMPIStatus.Get_source();
> MPI::COMM_WORLD.Recv( &(cBuffer[0], cBuffer.size(), MPI_CHAR, iSource,
> MPI_ANY_TAG, cMPIStatus );
>
>
> You could also use the tag to differentiate messages from a single
> source. This does eliminate the need to send 2 messages, one with the
> size and then one with the array. That is what I liked most about this
> solution.
>
> I hope this helps.
>
> David Blankenship
>
>
>
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Manal Helal
> Sent: Wednesday, May 16, 2007 2:44 AM
> To: mpich-discuss-digest at mcs.anl.gov
> Subject: [MPICH] non-blocking sending/receiving an array
>
> Hi
>
> I am trying to send an array, I send its size first, and then send the
> array itself, however, I am sending in a loop and receiving in a loop,
> so I end up receiving in different order, like I receive the array
> size, and then receive from the same sender the array of different
> size sent at another iteration, and I am using non-blocking
> communication,  and testing now for 3 processes, but could be more
> later, so, I can only specify the sender in the receive of the array,
> as the one I received the array size from, but I can't specify the
> size, it is giving me:
>
> rank 2 in job 4  localhost.localdomain_54476   caused collective abort
> of all ranks
>   exit status of rank 2: killed by signal 9
> 2:  MPI_Wait(140)..........................:
> MPI_Wait(request=0xb6b55198, status0xb6b5519c) failed
> 2:  MPIDI_CH3U_Post_data_receive_found(163): Message from rank 0 and
> tag 92 truncated; 224 bytes received but buffer size is 56
>
>
> is there a way to probe for a specific size, and receive only if this
> is the size, in the MPI_Iprobe, there is no specification for the
> count,
>
> any ideas will greatly help,
>
> Thank you very much, Kind Regards,
>
> Manal
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070613/69aafcbd/attachment.htm>


More information about the mpich-discuss mailing list