[MPICH] error information----tcp port TIME_WAIT?

Rajeev Thakur thakur at mcs.anl.gov
Fri May 12 15:17:57 CDT 2006


MPICH-GM communicates using the GM message passing library, which is the
native communication on Myrinet. It shouldn't open any TCP sockets. Does
your application itself open sockets (and perhaps leave them open)?

Rajeev 

> -----Original Message-----
> From: Yusong Wang [mailto:ywang25 at aps.anl.gov] 
> Sent: Friday, May 12, 2006 11:09 AM
> To: Yusong Wang
> Cc: Rajeev Thakur; mpich-discuss at mcs.anl.gov
> Subject: RE: [MPICH] error information----tcp port TIME_WAIT?
> 
> I am experiencing a more serious problem with another case 
> from my regression test suite.
> Even MPICH-GM can't survive this time.  
> 
> In the previous test (script3), I found some tcp ports were 
> still alive (TIME_WAIT)  after the program quits normally. 
> This could explain why I failed to run the same code several 
> times without a break.
> 
> I heard Linux has a 1 minute timeout on TCP sockets. For this 
> new test, I wonder if the application program opened too many 
> ports, will it crash? In what situations the MPI program 
> could leave some unclosed ports? (I tested with the cpi, and 
> didn't find any ports left.) How can I improve my program to 
> avoid this problem?  Or is there any parameter I can control?
> 
> I attached some information generated with netstat command 
> before and after running my program.
> 
> Thanks for your advice!
> 
> Yusong 
> 
> --------------------------------------- before running my 
> code -------------------------------------------
> [ywang at j73 dscatter3]$ netstat
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address           Foreign Address   
>       State
> tcp        0      0 j73.lcrc.anl.gov:33049  
> jpvfs1.lcrc.anl.go:3000 ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33600  
> jpvfs5.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33602  
> jpvfs7.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33596  
> jpvfs1.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33598  
> jpvfs3.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:34087  
> jmayor6.lcrc.anl.g:5140 ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33601  
> jpvfs6.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33603  
> jpvfs8.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33597  
> jpvfs2.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33599  
> jpvfs4.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:34644  
> jlogin2.lcrc.anl.:58423 ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:ssh    
> jlogin2.lcrc.anl.:58500 ESTABLISHED
> Active UNIX domain sockets (w/o servers)
> 
> --------------------------------------- after running my code 
> -------------------------------------------
>  [ywang at j73 dscatter3]$ netstat
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address           Foreign Address   
>       State
> tcp        0      0 j73.lcrc.anl.gov:34696  
> j73.lcrc.anl.gov:34699  TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:34700  
> j73.lcrc.anl.gov:34698  TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:34702  
> j75.lcrc.anl.gov:34979  TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:33049  
> jpvfs1.lcrc.anl.go:3000 ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33600  
> jpvfs5.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33602  
> jpvfs7.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33596  
> jpvfs1.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33598  
> jpvfs3.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:1015   
> j117.lcrc.anl.gov:1023  TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:1013   
> j117.lcrc.anl.gov:1022  TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:34087  
> jmayor6.lcrc.anl.g:5140 ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:34703  
> j117.lcrc.anl.gov:34925 TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:1019   
> j73.lcrc.anl.gov:1010   TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:1007   
> j73.lcrc.anl.gov:1008   TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:shell  
> j73.lcrc.anl.gov:1009   TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:1022   
> j73.lcrc.anl.gov:1021   TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:34701  
> j74.lcrc.anl.gov:35070  TIME_WAIT
> tcp        0      0 j73.lcrc.anl.gov:33601  
> jpvfs6.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33603  
> jpvfs8.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33597  
> jpvfs2.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:33599  
> jpvfs4.:afs3-fileserver ESTABLISHED
> tcp        0      0 j73.lcrc.anl.gov:34644  
> jlogin2.lcrc.anl.:58423 ESTABLISHED
> tcp        0   1536 j73.lcrc.anl.gov:ssh    
> jlogin2.lcrc.anl.:58500 ESTABLISHED
> 
> 
> ----- Original Message -----
> From: Yusong Wang <ywang25 at aps.anl.gov>
> Date: Wednesday, May 10, 2006 11:13 pm
> Subject: Re: RE: [MPICH] error information
> 
> > This is interesting! 
> > After I switched to the MPICH-GM, the problem was gone. I can 
> > repeat the tests 30 times without the problem. It seems MPICH-GM is 
> > much more stable than the MPICH_P4. While our cluster is consist of 
> > 40 AMD SMP processors and 60 Intel SMP processors, with Sun Grid 
> > Engine. The MPICH_P4 seems to be the only version we can use if we 
> > can't get the MPICH2 work for our system. This is also could be the 
> > situation for the users of our code.
> > 
> > Thanks for your help.
> > 
> > Yusong 
> > 
> > ----- Original Message -----
> > From: Rajeev Thakur <thakur at mcs.anl.gov>
> > Date: Wednesday, May 10, 2006 8:22 pm
> > Subject: RE: [MPICH] error information
> > 
> > > You should be able to use MPICH-GM on jazz with the gcc compiler. 
> > > You might
> > > need to specify the right field in your .soft environment. See
> > > http://www.lcrc.anl.gov/faq/cache/54.html for example.
> > > 
> > > Rajeev
> > > 
> > > > -----Original Message-----
> > > > From: owner-mpich-discuss at mcs.anl.gov 
> > > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of 
> Yusong Wang
> > > > Sent: Wednesday, May 10, 2006 6:42 PM
> > > > To: Rusty Lusk
> > > > Cc: mpich-discuss at mcs.anl.gov
> > > > Subject: Re: [MPICH] error information
> > > > 
> > > > I may need wait some days before I can run it under MPICH2. I 
> > > > was able to run the program with command line under MPICH2 
> > > > environment on our cluster. Our system administrator was 
> > > > trying to integrate MPICH2 with Sun Grid Engine, but stuck at 
> > > > the use of smpd. Right Now,  I can't run the program with 
> > > > MPICH2 during the update.  It seems to me there is no gcc 
> > > > based MPICH2 available on Jazz and our code can only be 
> > > > compiled with gcc compiler.
> > > > 
> > > > The problem comes from a regression test of 100 cases. If I 
> > > > run them one by one (with some break time between each run), 
> > > > I would not expect this problem. It seems to me some 
> > > > operations have not been done although the previous run quit 
> > > > normally.   
> > > > 
> > > > Thanks,
> > > > 
> > > > Yusong
> > > > 
> > > > ----- Original Message -----
> > > > From: Rusty Lusk <lusk at mcs.anl.gov>
> > > > Date: Wednesday, May 10, 2006 4:34 pm
> > > > Subject: Re: [MPICH] error information
> > > > 
> > > > > You are using a very old version of MPICH.  Can you 
> use MPICH2?
> > > > > It might give you better information on termination.
> > > > > 
> > > > > Regards,
> > > > > Rusty Lusk
> > > > > 
> > > > > From: Yusong Wang <ywang25 at aps.anl.gov>
> > > > > Subject: [MPICH] error information
> > > > > Date: Wed, 10 May 2006 16:27:13 -0500
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > I repeated a same test several times on Jazz. Most times it 
> > > > > works fine,
> > > > > > occasionally (1 out of 5 runs), I got the following errors:
> > > > > > 
> > > > > > /soft/apps/packages/mpich-p4-1.2.6-gcc-3.2.3-1/bin/mpirun: 
> > > line 
> > > > > 1: 24600
> > > > > > Broken pipe             
> > > > /home/ywang/oag/apps/bin/linux-x86/Pelegant
> > > > > > "run.ele" -p4pg /home/ywang/elegantRuns/script3/PI24473 -
> > > > > > p4wd /home/ywang/elegantRuns/script3
> > > > > >     p4_error: latest msg from perror: Bad file descriptor
> > > > > > rm_l_2_16806: (1.024331) net_send: could not write to fd=6, 
> > > > > errno = 9
> > > > > > rm_l_2_16806:  p4_error: net_send write: -1
> > > > > > Broken pipe
> > > > > > length of beamline PAR per pass: 3.066670000001400e+01 m
> > > > > > statistics:    ET:     00:00:01 CP:    0.09 BIO:0 DIO:0 
> > PF:0 
> > > MEM:0> > > p3_15201:  p4_error: net_recv read:  probable EOF on 
> > > socket: 1
> > > > > > Broken pipe
> > > > > > 
> > > > > > I can't find the reason of this problem. The same thing 
> > > > happened on
> > > > > > another cluster. The totalview debugger didn't give me too 
> > > much 
> > > > > useful> information. The survived processes just stuck at an 
> > > > > MPI_Barrier> command. 
> > > > > > 
> > > > > > Can someone give me some hint to fixed the problem 
> > > > according to the
> > > > > > error information given above?
> > > > > > 
> > > > > > The working directory is:
> > > > > >  /home/ywang/elegantRuns/script3/
> > > > > > The command I used:
> > > > > > mpirun -np 4 -machinefile $PBS_NODEFILE 
> > > > > /home/ywang/oag/apps/bin/linux-
> > > > > > x86/Pelegant run.ele
> > > > > > 
> > > > > > Thanks in advance,
> > > > > > 
> > > > > > Yusong Wang
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 




More information about the mpich-discuss mailing list