[mpich-discuss] Problems with LD_PRELOAD on TCP functions
Alex Margolin
alex.margolin at mail.huji.ac.il
Sat Jan 21 14:25:06 CST 2012
Hi,
I've written a general-purpose adapter for TCP-based sockets in linux
apps (libdicom.so).
The adapter does some optimizations, but mostly forwards the calls to
the original functions.
When I tried to run it with MPICH, the application got stuck (used
Ctrl-C to terminate).
To debug this, I've also written a dummy lib which forwards calls
without any changes (only printouts).
For some reason, the dummy fails too, but with different output...
I used LD_PRELOAD to catch the following (mostly) Berkeley socket API
functions:
- Socket creation socket, socketpair
- Socket establishment: bind, listen, connect, accept
- Socket data transfer: send, recv, read, write, readv, writev,
recvfrom, sendto
- Socket events: select, poll
- File descriptor actions: close, dup, dup2, fcntl
My adapter detects TCP sockets using the following condition, and should
"handle" only these sockets (other calls are automatically forwarded):
#define SHOULDNT_BE_DICOM ((domain != AF_INET) || ((protocol !=
IPPROTO_TCP) && (protocol != IPPROTO_IP)) || (0 == (type & SOCK_STREAM)))
To make things simple - both the dummy and the adapter break
readv/writev into a loop of read/write calls. It's not great, but should
work.
I've used strace -p <pid> to check the reason of the freeze (with
libdicom) and found two infinite poll() loops on other FDs (not 6 or 12).
The code for 'simple' are at the bottom, and fairly simple.
Below are my attempts to debug this... It's kind of a mess, but I've
added comments (anything after '%').
$ LD_PRELOAD=libdummy.so ~/mpich/bin/mpiexec -n 2 simple
Started as #Started as #0 out of 21 out of 2
writev(12)->send(12) * 2 % writev called on TCP-socket of FD 12
sent on 12: 8 - [0 0 0 0 4 0 0 0 ] % sent these 8 bytes on #12
read(6)->recv(6) % read called on TCP-socket of FD 6
(connected to 12)
recv(6)...
recv on 6: 8 [0 0 0 0 4 0 0 0 ] % Got these 8 bytes on #12
readv(6)->recv(6) * 1 % readv called on #6
recv(6)...
recv on 6: -1 [] % Got return value of -1! (could be
an error or EAGAIN)
WRITEV 0/2(8) on 12 => 8 % first part of writev() on #12
compete: 8 bytes sent (see first 2 lines)
sent on 12: 4 - [0 0 0 0 ] % another 4 bytes sent on #12
WRITEV 1/2(4) on 12 => 4 % second part of writev() on #12
compete: 4 bytes sent (see previous line) for a total of 12 so far
READV 0/1(4) on 6 => -1 % readv on #6 passes on the -1
(EAGAIN? didn't wait for second part of writev?!)
read(12)->recv(12)
recv(12)...
recv on 12: -1 [] % socket #12 reads -1 (EAGAIN?)
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)..............: MPI_Send(buf=0x7fff7103b2c0, count=1,
MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1827): Communication error with rank 1: Connection
reset by peer
^C^C
$ LD_PRELOAD=libdicom.so ~/mpich/bin/mpiexec -n 2 simple
Started as #0 out of 2
Started as #1 out of 2
read(6) -> recv(6)
Exchanged #6: <1,9505,1,9503(9503)> % connection established between
the two PIDs - same FDs as in dummy case
Exchanged #12: <1,9503,1,9505(9505)>
RDV BEFORE (on 6)
RDV AFTER (6,8,[11])
recv on 6: 8/8 - <0 0 0 0 4 0 0 0 > % #6 got first 8 bytes
RDV BEFORE (on 6)
sent on 12: 8/8 - [0 0 0 0 4 0 0 0 ] % #12 sent first 8 bytes
WRITEV 0/2(8) on 12 => 8
RDV AFTER (6,4,[11])
recv on 6: 4/4 - <0 0 0 0 > % #6 got next 4 bytes
READV 0/1(4) on 6 => 4
write(6) -> send(6)
sent on 12: 4/4 - [0 0 0 0 ] % #12 sent next 4 bytes
WRITEV 1/2(4) on 12 => 4
sent on 6: 8/8 - [1 0 0 0 0 0 0 0 ] % #6 sent another 8 bytes (what do
they mean?!)
^C^C
$ ~/mpich/bin/mpiexec -n 2 simple
Started as #1 out of 2
Started as #0 out of 2
#0 Got 0 from 0
$
================================================
simple.cpp
================================================
#include "mpi.h"
#include <iostream>
int main (int argc, char** argv)
{
int index, rank, np, comm;
MPI::Init(argc, argv);
np = MPI::COMM_WORLD.Get_size();
rank = MPI::COMM_WORLD.Get_rank();
std::cout<<"Started as #"<<rank<<" out of "<<np<<std::endl;
for (index = 0; (index < rank); index++)
{
MPI::COMM_WORLD.Recv(&comm, 1, MPI::INT, index, rank);
std::cout<<"#"<<rank-1<<" Got "<<comm<<" from "<<index<<"\n";
}
comm = rank;
for (index = rank + 1; (index < np); index++)
{
MPI::COMM_WORLD.Send(&comm, 1, MPI::INT, index, index);
}
MPI::Finalize();
}
Any help or lead would be greatly appreciated,
Alex
P.S. I've built a version of mpich2-1.4.1p1, but i've checked other
versions earlier, with similar results :-(
More information about the mpich-discuss
mailing list