[mpich-discuss] Problems with LD_PRELOAD on TCP functions

Alex Margolin alex.margolin at mail.huji.ac.il
Sat Jan 21 14:33:02 CST 2012


On 01/21/2012 10:25 PM, Alex Margolin wrote:
> Hi,
>
> I've written a general-purpose adapter for TCP-based sockets in linux 
> apps (libdicom.so).
> The adapter does some optimizations, but mostly forwards the calls to 
> the original functions.
> When I tried to run it with MPICH, the application got stuck (used 
> Ctrl-C to terminate).
> To debug this, I've also written a dummy lib which forwards calls 
> without any changes (only printouts).
> For some reason, the dummy fails too, but with different output...
>
> I used LD_PRELOAD to catch the following (mostly) Berkeley socket API 
> functions:
> - Socket creation socket, socketpair
> - Socket establishment: bind, listen, connect, accept
> - Socket data transfer: send, recv, read, write, readv, writev, 
> recvfrom, sendto
> - Socket events: select, poll
> - File descriptor actions: close, dup, dup2, fcntl
>
> My adapter detects TCP sockets using the following condition, and 
> should "handle" only these sockets (other calls are automatically 
> forwarded):
> #define SHOULDNT_BE_DICOM ((domain != AF_INET) || ((protocol != 
> IPPROTO_TCP) && (protocol != IPPROTO_IP)) || (0 == (type & SOCK_STREAM)))
>
> To make things simple - both the dummy and the adapter break 
> readv/writev into a loop of read/write calls. It's not great, but 
> should work.
> I've used strace -p <pid> to check the reason of the freeze (with 
> libdicom) and found two infinite poll() loops on other FDs (not 6 or 12).
> The code for 'simple' are at the bottom, and fairly simple.
> Below are my attempts to debug this... It's kind of a mess, but I've 
> added comments (anything after '%').
>
>
> $ LD_PRELOAD=libdummy.so ~/mpich/bin/mpiexec -n 2 simple
> Started as #Started as #0 out of 21 out of 2
>
> writev(12)->send(12) * 2            % writev called on TCP-socket of 
> FD 12
> sent on 12: 8 - [0 0 0 0 4 0 0 0 ]  % sent these 8 bytes on #12
> read(6)->recv(6)                    % read called on TCP-socket of FD 
> 6 (connected to 12)
> recv(6)...
> recv on 6: 8 [0 0 0 0 4 0 0 0 ]     % Got these 8 bytes on #12
> readv(6)->recv(6) * 1               % readv called on #6
> recv(6)...
> recv on 6: -1 []                    % Got return value of -1! (could 
> be an error or EAGAIN)
> WRITEV 0/2(8) on 12 => 8            % first part of writev() on #12 
> compete: 8 bytes sent (see first 2 lines)
> sent on 12: 4 - [0 0 0 0 ]          % another 4 bytes sent on #12
> WRITEV 1/2(4) on 12 => 4            % second part of writev() on #12 
> compete: 4 bytes sent (see previous line) for a total of 12 so far
> READV 0/1(4) on 6 => -1             % readv on #6 passes on the -1 
> (EAGAIN? didn't wait for second part of writev?!)
> read(12)->recv(12)
> recv(12)...
> recv on 12: -1 []                   % socket #12 reads -1 (EAGAIN?)
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173)..............: MPI_Send(buf=0x7fff7103b2c0, count=1, 
> MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed
> MPID_nem_tcp_connpoll(1827): Communication error with rank 1: 
> Connection reset by peer
>
> ^C^C
> $ LD_PRELOAD=libdicom.so ~/mpich/bin/mpiexec -n 2 simple
> Started as #0 out of 2
> Started as #1 out of 2
> read(6) -> recv(6)
> Exchanged #6: <1,9505,1,9503(9503)>   % connection established between 
> the two PIDs - same FDs as in dummy case
> Exchanged #12: <1,9503,1,9505(9505)>
> RDV BEFORE (on 6)
> RDV AFTER (6,8,[11])
> recv on 6: 8/8 - <0 0 0 0 4 0 0 0 >   % #6 got first 8 bytes
> RDV BEFORE (on 6)
> sent on 12: 8/8 - [0 0 0 0 4 0 0 0 ]  % #12 sent first 8 bytes
> WRITEV 0/2(8) on 12 => 8
> RDV AFTER (6,4,[11])
> recv on 6: 4/4 - <0 0 0 0 >           % #6 got next 4 bytes
> READV 0/1(4) on 6 => 4
> write(6) -> send(6)
> sent on 12: 4/4 - [0 0 0 0 ]          %  #12 sent next 4 bytes
> WRITEV 1/2(4) on 12 => 4
> sent on 6: 8/8 - [1 0 0 0 0 0 0 0 ]   % #6 sent another 8 bytes (what 
> do they mean?!)
>
> ^C^C
> $ ~/mpich/bin/mpiexec -n 2 simple
> Started as #1 out of 2
> Started as #0 out of 2
> #0 Got 0 from 0
> $
>
> ================================================
> simple.cpp
> ================================================
> #include "mpi.h"
> #include <iostream>
>
> int main (int argc, char** argv)
> {
>   int index, rank, np, comm;
>   MPI::Init(argc, argv);
>   np = MPI::COMM_WORLD.Get_size();
>   rank = MPI::COMM_WORLD.Get_rank();
>   std::cout<<"Started as #"<<rank<<" out of "<<np<<std::endl;
>   for (index = 0; (index < rank); index++)
>     {
>       MPI::COMM_WORLD.Recv(&comm, 1, MPI::INT, index, rank);
>       std::cout<<"#"<<rank-1<<" Got "<<comm<<" from "<<index<<"\n";
>     }
>
>   comm = rank;
>   for (index = rank + 1; (index < np); index++)
>     {
>       MPI::COMM_WORLD.Send(&comm, 1, MPI::INT, index, index);
>     }
>
>   MPI::Finalize();
> }
>
> Any help or lead would be greatly appreciated,
> Alex
>
> P.S. I've built a version of mpich2-1.4.1p1, but i've checked other 
> versions earlier, with similar results :-(
I've ran the dummy several times, because i thought it might by some 
kind of a race-condition, and I got one bizarre run...

  - Any thoughts?

$ LD_PRELOAD=/libdummy.so ~/mpich/bin/mpiexec -n 2 simple
Started as #Started as #1 out of 0 out of 2
2
writev(12)->send(12) * 2
sent on 12: 8 - [0 0 0 0 4 0 0 0 ]
WRITEV 0/2(8) on 12 => 8
sent on 12: 4 - [0
0 read(6)->recv(6)
recv(6)...
0 recv on 6: 8 [0 0 ]0 0 0 4 0 0 0 ]
readv(6)->recv(6) * 1
recv(6)...
recv on 6: 4 [0 0 0 0 ]

WRITEV 1/2(4) on 12 => 4
READV 0/1(4) on 6 => 4
write(6)->send(6)
sent on 6: 8 - [1 0 0 0 0 0 0 0 ]
read(12)->recv(12)
recv(12)...
recv on 12: 8 [1 0 0 0 0 0 0 0 ]
writev(12)->send(12) * 1
sent on 12: 40 - [recv(6)...
1 recv on 6: 40 [1 0 0 0 -62 127 0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 
0 0 -62 127 0 0 0 0 0 0 0 0 0 0 ]
#0 Got 0 from 0
writev(6)->send(6) * 2
0 0 0 -62 127 0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 -62 127 0 0 0 
0 0 0 0 0 0 0 ]
WRITEV 0/1(40) on 12 => 40
writev(12)->send(12) * 2
sent on 12: 40 - [22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64 sent on 6: 40 - 
[22 0 0 0 0 0 0 80 115 0 0 0 0 0 12 66 68 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 
0 64 80 115 0 0 0 0 0 0 -1 127 0 0 ]
WRITEV 0/2(40) on 12 => 40
sent on 12: -1 - []0 12 66 68 0 0 0
WRITEV 1/2(0) on 12 => -1
0 0 64 46 115 0 0 0 0 0 ]Assertion failed in file 
ch3u_handle_connection.c at line 255: MPIU_Object_get_ref(((sreq))) >= 0
internal ABORT - process 0

WRITEV 0/2(40) on 6 => 40
sent on 6: -1 - []
WRITEV 1/2(0) on 6 => -1
Assertion failed in file ch3u_handle_connection.c at line 255: 
MPIU_Object_get_ref(((sreq))) >= 0
internal ABORT - process 1



More information about the mpich-discuss mailing list