[mpich-discuss] Problems with LD_PRELOAD on TCP functions

Alex Margolin alex.margolin at mail.huji.ac.il
Sat Jan 21 14:25:06 CST 2012


Hi,

I've written a general-purpose adapter for TCP-based sockets in linux 
apps (libdicom.so).
The adapter does some optimizations, but mostly forwards the calls to 
the original functions.
When I tried to run it with MPICH, the application got stuck (used 
Ctrl-C to terminate).
To debug this, I've also written a dummy lib which forwards calls 
without any changes (only printouts).
For some reason, the dummy fails too, but with different output...

I used LD_PRELOAD to catch the following (mostly) Berkeley socket API 
functions:
- Socket creation socket, socketpair
- Socket establishment: bind, listen, connect, accept
- Socket data transfer: send, recv, read, write, readv, writev, 
recvfrom, sendto
- Socket events: select, poll
- File descriptor actions: close, dup, dup2, fcntl

My adapter detects TCP sockets using the following condition, and should 
"handle" only these sockets (other calls are automatically forwarded):
#define SHOULDNT_BE_DICOM ((domain != AF_INET) || ((protocol != 
IPPROTO_TCP) && (protocol != IPPROTO_IP)) || (0 == (type & SOCK_STREAM)))

To make things simple - both the dummy and the adapter break 
readv/writev into a loop of read/write calls. It's not great, but should 
work.
I've used strace -p <pid> to check the reason of the freeze (with 
libdicom) and found two infinite poll() loops on other FDs (not 6 or 12).
The code for 'simple' are at the bottom, and fairly simple.
Below are my attempts to debug this... It's kind of a mess, but I've 
added comments (anything after '%').


$ LD_PRELOAD=libdummy.so ~/mpich/bin/mpiexec -n 2 simple
Started as #Started as #0 out of 21 out of 2

writev(12)->send(12) * 2            % writev called on TCP-socket of FD 12
sent on 12: 8 - [0 0 0 0 4 0 0 0 ]  % sent these 8 bytes on #12
read(6)->recv(6)                    % read called on TCP-socket of FD 6 
(connected to 12)
recv(6)...
recv on 6: 8 [0 0 0 0 4 0 0 0 ]     % Got these 8 bytes on #12
readv(6)->recv(6) * 1               % readv called on #6
recv(6)...
recv on 6: -1 []                    % Got return value of -1! (could be 
an error or EAGAIN)
WRITEV 0/2(8) on 12 => 8            % first part of writev() on #12 
compete: 8 bytes sent (see first 2 lines)
sent on 12: 4 - [0 0 0 0 ]          % another 4 bytes sent on #12
WRITEV 1/2(4) on 12 => 4            % second part of writev() on #12 
compete: 4 bytes sent (see previous line) for a total of 12 so far
READV 0/1(4) on 6 => -1             % readv on #6 passes on the -1 
(EAGAIN? didn't wait for second part of writev?!)
read(12)->recv(12)
recv(12)...
recv on 12: -1 []                   % socket #12 reads -1 (EAGAIN?)
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)..............: MPI_Send(buf=0x7fff7103b2c0, count=1, 
MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1827): Communication error with rank 1: Connection 
reset by peer

^C^C
$ LD_PRELOAD=libdicom.so ~/mpich/bin/mpiexec -n 2 simple
Started as #0 out of 2
Started as #1 out of 2
read(6) -> recv(6)
Exchanged #6: <1,9505,1,9503(9503)>   % connection established between 
the two PIDs - same FDs as in dummy case
Exchanged #12: <1,9503,1,9505(9505)>
RDV BEFORE (on 6)
RDV AFTER (6,8,[11])
recv on 6: 8/8 - <0 0 0 0 4 0 0 0 >   % #6 got first 8 bytes
RDV BEFORE (on 6)
sent on 12: 8/8 - [0 0 0 0 4 0 0 0 ]  % #12 sent first 8 bytes
WRITEV 0/2(8) on 12 => 8
RDV AFTER (6,4,[11])
recv on 6: 4/4 - <0 0 0 0 >           % #6 got next 4 bytes
READV 0/1(4) on 6 => 4
write(6) -> send(6)
sent on 12: 4/4 - [0 0 0 0 ]          %  #12 sent next 4 bytes
WRITEV 1/2(4) on 12 => 4
sent on 6: 8/8 - [1 0 0 0 0 0 0 0 ]   % #6 sent another 8 bytes (what do 
they mean?!)

^C^C
$ ~/mpich/bin/mpiexec -n 2 simple
Started as #1 out of 2
Started as #0 out of 2
#0 Got 0 from 0
$

================================================
simple.cpp
================================================
#include "mpi.h"
#include <iostream>

int main (int argc, char** argv)
{
   int index, rank, np, comm;
   MPI::Init(argc, argv);
   np = MPI::COMM_WORLD.Get_size();
   rank = MPI::COMM_WORLD.Get_rank();
   std::cout<<"Started as #"<<rank<<" out of "<<np<<std::endl;
   for (index = 0; (index < rank); index++)
     {
       MPI::COMM_WORLD.Recv(&comm, 1, MPI::INT, index, rank);
       std::cout<<"#"<<rank-1<<" Got "<<comm<<" from "<<index<<"\n";
     }

   comm = rank;
   for (index = rank + 1; (index < np); index++)
     {
       MPI::COMM_WORLD.Send(&comm, 1, MPI::INT, index, index);
     }

   MPI::Finalize();
}

Any help or lead would be greatly appreciated,
Alex

P.S. I've built a version of mpich2-1.4.1p1, but i've checked other 
versions earlier, with similar results :-(


More information about the mpich-discuss mailing list