[mpich-discuss] Problems with LD_PRELOAD on TCP functions
Dave Goodell
goodell at mcs.anl.gov
Mon Jan 23 13:23:46 CST 2012
Not sure what's happening, but I'm guessing that the bug lies in your wrapper layer somewhere since MPICH2 seems to work reliably on most stock POSIX-ish operating systems. Are you handling O_NONBLOCK sockets correctly in your layer? MPICH2 uses them heavily.
-Dave
On Jan 21, 2012, at 2:33 PM CST, Alex Margolin wrote:
> On 01/21/2012 10:25 PM, Alex Margolin wrote:
>> Hi,
>>
>> I've written a general-purpose adapter for TCP-based sockets in linux apps (libdicom.so).
>> The adapter does some optimizations, but mostly forwards the calls to the original functions.
>> When I tried to run it with MPICH, the application got stuck (used Ctrl-C to terminate).
>> To debug this, I've also written a dummy lib which forwards calls without any changes (only printouts).
>> For some reason, the dummy fails too, but with different output...
>>
>> I used LD_PRELOAD to catch the following (mostly) Berkeley socket API functions:
>> - Socket creation socket, socketpair
>> - Socket establishment: bind, listen, connect, accept
>> - Socket data transfer: send, recv, read, write, readv, writev, recvfrom, sendto
>> - Socket events: select, poll
>> - File descriptor actions: close, dup, dup2, fcntl
>>
>> My adapter detects TCP sockets using the following condition, and should "handle" only these sockets (other calls are automatically forwarded):
>> #define SHOULDNT_BE_DICOM ((domain != AF_INET) || ((protocol != IPPROTO_TCP) && (protocol != IPPROTO_IP)) || (0 == (type & SOCK_STREAM)))
>>
>> To make things simple - both the dummy and the adapter break readv/writev into a loop of read/write calls. It's not great, but should work.
>> I've used strace -p <pid> to check the reason of the freeze (with libdicom) and found two infinite poll() loops on other FDs (not 6 or 12).
>> The code for 'simple' are at the bottom, and fairly simple.
>> Below are my attempts to debug this... It's kind of a mess, but I've added comments (anything after '%').
>>
>>
>> $ LD_PRELOAD=libdummy.so ~/mpich/bin/mpiexec -n 2 simple
>> Started as #Started as #0 out of 21 out of 2
>>
>> writev(12)->send(12) * 2 % writev called on TCP-socket of FD 12
>> sent on 12: 8 - [0 0 0 0 4 0 0 0 ] % sent these 8 bytes on #12
>> read(6)->recv(6) % read called on TCP-socket of FD 6 (connected to 12)
>> recv(6)...
>> recv on 6: 8 [0 0 0 0 4 0 0 0 ] % Got these 8 bytes on #12
>> readv(6)->recv(6) * 1 % readv called on #6
>> recv(6)...
>> recv on 6: -1 [] % Got return value of -1! (could be an error or EAGAIN)
>> WRITEV 0/2(8) on 12 => 8 % first part of writev() on #12 compete: 8 bytes sent (see first 2 lines)
>> sent on 12: 4 - [0 0 0 0 ] % another 4 bytes sent on #12
>> WRITEV 1/2(4) on 12 => 4 % second part of writev() on #12 compete: 4 bytes sent (see previous line) for a total of 12 so far
>> READV 0/1(4) on 6 => -1 % readv on #6 passes on the -1 (EAGAIN? didn't wait for second part of writev?!)
>> read(12)->recv(12)
>> recv(12)...
>> recv on 12: -1 [] % socket #12 reads -1 (EAGAIN?)
>> Fatal error in MPI_Send: Other MPI error, error stack:
>> MPI_Send(173)..............: MPI_Send(buf=0x7fff7103b2c0, count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed
>> MPID_nem_tcp_connpoll(1827): Communication error with rank 1: Connection reset by peer
>>
>> ^C^C
>> $ LD_PRELOAD=libdicom.so ~/mpich/bin/mpiexec -n 2 simple
>> Started as #0 out of 2
>> Started as #1 out of 2
>> read(6) -> recv(6)
>> Exchanged #6: <1,9505,1,9503(9503)> % connection established between the two PIDs - same FDs as in dummy case
>> Exchanged #12: <1,9503,1,9505(9505)>
>> RDV BEFORE (on 6)
>> RDV AFTER (6,8,[11])
>> recv on 6: 8/8 - <0 0 0 0 4 0 0 0 > % #6 got first 8 bytes
>> RDV BEFORE (on 6)
>> sent on 12: 8/8 - [0 0 0 0 4 0 0 0 ] % #12 sent first 8 bytes
>> WRITEV 0/2(8) on 12 => 8
>> RDV AFTER (6,4,[11])
>> recv on 6: 4/4 - <0 0 0 0 > % #6 got next 4 bytes
>> READV 0/1(4) on 6 => 4
>> write(6) -> send(6)
>> sent on 12: 4/4 - [0 0 0 0 ] % #12 sent next 4 bytes
>> WRITEV 1/2(4) on 12 => 4
>> sent on 6: 8/8 - [1 0 0 0 0 0 0 0 ] % #6 sent another 8 bytes (what do they mean?!)
>>
>> ^C^C
>> $ ~/mpich/bin/mpiexec -n 2 simple
>> Started as #1 out of 2
>> Started as #0 out of 2
>> #0 Got 0 from 0
>> $
>>
>> ================================================
>> simple.cpp
>> ================================================
>> #include "mpi.h"
>> #include <iostream>
>>
>> int main (int argc, char** argv)
>> {
>> int index, rank, np, comm;
>> MPI::Init(argc, argv);
>> np = MPI::COMM_WORLD.Get_size();
>> rank = MPI::COMM_WORLD.Get_rank();
>> std::cout<<"Started as #"<<rank<<" out of "<<np<<std::endl;
>> for (index = 0; (index < rank); index++)
>> {
>> MPI::COMM_WORLD.Recv(&comm, 1, MPI::INT, index, rank);
>> std::cout<<"#"<<rank-1<<" Got "<<comm<<" from "<<index<<"\n";
>> }
>>
>> comm = rank;
>> for (index = rank + 1; (index < np); index++)
>> {
>> MPI::COMM_WORLD.Send(&comm, 1, MPI::INT, index, index);
>> }
>>
>> MPI::Finalize();
>> }
>>
>> Any help or lead would be greatly appreciated,
>> Alex
>>
>> P.S. I've built a version of mpich2-1.4.1p1, but i've checked other versions earlier, with similar results :-(
> I've ran the dummy several times, because i thought it might by some kind of a race-condition, and I got one bizarre run...
>
> - Any thoughts?
>
> $ LD_PRELOAD=/libdummy.so ~/mpich/bin/mpiexec -n 2 simple
> Started as #Started as #1 out of 0 out of 2
> 2
> writev(12)->send(12) * 2
> sent on 12: 8 - [0 0 0 0 4 0 0 0 ]
> WRITEV 0/2(8) on 12 => 8
> sent on 12: 4 - [0
> 0 read(6)->recv(6)
> recv(6)...
> 0 recv on 6: 8 [0 0 ]0 0 0 4 0 0 0 ]
> readv(6)->recv(6) * 1
> recv(6)...
> recv on 6: 4 [0 0 0 0 ]
>
> WRITEV 1/2(4) on 12 => 4
> READV 0/1(4) on 6 => 4
> write(6)->send(6)
> sent on 6: 8 - [1 0 0 0 0 0 0 0 ]
> read(12)->recv(12)
> recv(12)...
> recv on 12: 8 [1 0 0 0 0 0 0 0 ]
> writev(12)->send(12) * 1
> sent on 12: 40 - [recv(6)...
> 1 recv on 6: 40 [1 0 0 0 -62 127 0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 -62 127 0 0 0 0 0 0 0 0 0 0 ]
> #0 Got 0 from 0
> writev(6)->send(6) * 2
> 0 0 0 -62 127 0 0 1 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 -62 127 0 0 0 0 0 0 0 0 0 0 ]
> WRITEV 0/1(40) on 12 => 40
> writev(12)->send(12) * 2
> sent on 12: 40 - [22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64 sent on 6: 40 - [22 0 0 0 0 0 0 80 115 0 0 0 0 0 12 66 68 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 64 80 115 0 0 0 0 0 0 -1 127 0 0 ]
> WRITEV 0/2(40) on 12 => 40
> sent on 12: -1 - []0 12 66 68 0 0 0
> WRITEV 1/2(0) on 12 => -1
> 0 0 64 46 115 0 0 0 0 0 ]Assertion failed in file ch3u_handle_connection.c at line 255: MPIU_Object_get_ref(((sreq))) >= 0
> internal ABORT - process 0
>
> WRITEV 0/2(40) on 6 => 40
> sent on 6: -1 - []
> WRITEV 1/2(0) on 6 => -1
> Assertion failed in file ch3u_handle_connection.c at line 255: MPIU_Object_get_ref(((sreq))) >= 0
> internal ABORT - process 1
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list