[MPICH] error information----tcp port TIME_WAIT?

Yusong Wang ywang25 at aps.anl.gov
Fri May 12 11:09:10 CDT 2006


I am experiencing a more serious problem with another case from my regression test suite.
Even MPICH-GM can't survive this time.  

In the previous test (script3), I found some tcp ports were still alive (TIME_WAIT)  after the program quits normally. This could explain why I failed to run the same code several times without a break.

I heard Linux has a 1 minute timeout on TCP sockets. For this new test, I wonder if the application program opened too many ports, will it crash? In what situations the MPI program could leave some unclosed ports? (I tested with the cpi, and didn't find any ports left.) How can I improve my program to avoid this problem?  Or is there any parameter I can control?

I attached some information generated with netstat command before and after running my program.

Thanks for your advice!

Yusong 

--------------------------------------- before running my code -------------------------------------------
[ywang at j73 dscatter3]$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 j73.lcrc.anl.gov:33049  jpvfs1.lcrc.anl.go:3000 ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33600  jpvfs5.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33602  jpvfs7.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33596  jpvfs1.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33598  jpvfs3.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:34087  jmayor6.lcrc.anl.g:5140 ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33601  jpvfs6.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33603  jpvfs8.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33597  jpvfs2.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33599  jpvfs4.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:34644  jlogin2.lcrc.anl.:58423 ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:ssh    jlogin2.lcrc.anl.:58500 ESTABLISHED
Active UNIX domain sockets (w/o servers)

--------------------------------------- after running my code -------------------------------------------
 [ywang at j73 dscatter3]$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 j73.lcrc.anl.gov:34696  j73.lcrc.anl.gov:34699  TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:34700  j73.lcrc.anl.gov:34698  TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:34702  j75.lcrc.anl.gov:34979  TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:33049  jpvfs1.lcrc.anl.go:3000 ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33600  jpvfs5.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33602  jpvfs7.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33596  jpvfs1.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33598  jpvfs3.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:1015   j117.lcrc.anl.gov:1023  TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:1013   j117.lcrc.anl.gov:1022  TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:34087  jmayor6.lcrc.anl.g:5140 ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:34703  j117.lcrc.anl.gov:34925 TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:1019   j73.lcrc.anl.gov:1010   TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:1007   j73.lcrc.anl.gov:1008   TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:shell  j73.lcrc.anl.gov:1009   TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:1022   j73.lcrc.anl.gov:1021   TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:34701  j74.lcrc.anl.gov:35070  TIME_WAIT
tcp        0      0 j73.lcrc.anl.gov:33601  jpvfs6.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33603  jpvfs8.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33597  jpvfs2.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:33599  jpvfs4.:afs3-fileserver ESTABLISHED
tcp        0      0 j73.lcrc.anl.gov:34644  jlogin2.lcrc.anl.:58423 ESTABLISHED
tcp        0   1536 j73.lcrc.anl.gov:ssh    jlogin2.lcrc.anl.:58500 ESTABLISHED


----- Original Message -----
From: Yusong Wang <ywang25 at aps.anl.gov>
Date: Wednesday, May 10, 2006 11:13 pm
Subject: Re: RE: [MPICH] error information

> This is interesting! 
> After I switched to the MPICH-GM, the problem was gone. I can 
> repeat the tests 30 times without the problem. It seems MPICH-GM is 
> much more stable than the MPICH_P4. While our cluster is consist of 
> 40 AMD SMP processors and 60 Intel SMP processors, with Sun Grid 
> Engine. The MPICH_P4 seems to be the only version we can use if we 
> can't get the MPICH2 work for our system. This is also could be the 
> situation for the users of our code.
> 
> Thanks for your help.
> 
> Yusong 
> 
> ----- Original Message -----
> From: Rajeev Thakur <thakur at mcs.anl.gov>
> Date: Wednesday, May 10, 2006 8:22 pm
> Subject: RE: [MPICH] error information
> 
> > You should be able to use MPICH-GM on jazz with the gcc compiler. 
> > You might
> > need to specify the right field in your .soft environment. See
> > http://www.lcrc.anl.gov/faq/cache/54.html for example.
> > 
> > Rajeev
> > 
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov 
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Yusong Wang
> > > Sent: Wednesday, May 10, 2006 6:42 PM
> > > To: Rusty Lusk
> > > Cc: mpich-discuss at mcs.anl.gov
> > > Subject: Re: [MPICH] error information
> > > 
> > > I may need wait some days before I can run it under MPICH2. I 
> > > was able to run the program with command line under MPICH2 
> > > environment on our cluster. Our system administrator was 
> > > trying to integrate MPICH2 with Sun Grid Engine, but stuck at 
> > > the use of smpd. Right Now,  I can't run the program with 
> > > MPICH2 during the update.  It seems to me there is no gcc 
> > > based MPICH2 available on Jazz and our code can only be 
> > > compiled with gcc compiler.
> > > 
> > > The problem comes from a regression test of 100 cases. If I 
> > > run them one by one (with some break time between each run), 
> > > I would not expect this problem. It seems to me some 
> > > operations have not been done although the previous run quit 
> > > normally.   
> > > 
> > > Thanks,
> > > 
> > > Yusong
> > > 
> > > ----- Original Message -----
> > > From: Rusty Lusk <lusk at mcs.anl.gov>
> > > Date: Wednesday, May 10, 2006 4:34 pm
> > > Subject: Re: [MPICH] error information
> > > 
> > > > You are using a very old version of MPICH.  Can you use MPICH2?
> > > > It might give you better information on termination.
> > > > 
> > > > Regards,
> > > > Rusty Lusk
> > > > 
> > > > From: Yusong Wang <ywang25 at aps.anl.gov>
> > > > Subject: [MPICH] error information
> > > > Date: Wed, 10 May 2006 16:27:13 -0500
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I repeated a same test several times on Jazz. Most times it 
> > > > works fine,
> > > > > occasionally (1 out of 5 runs), I got the following errors:
> > > > > 
> > > > > /soft/apps/packages/mpich-p4-1.2.6-gcc-3.2.3-1/bin/mpirun: 
> > line 
> > > > 1: 24600
> > > > > Broken pipe             
> > > /home/ywang/oag/apps/bin/linux-x86/Pelegant
> > > > > "run.ele" -p4pg /home/ywang/elegantRuns/script3/PI24473 -
> > > > > p4wd /home/ywang/elegantRuns/script3
> > > > >     p4_error: latest msg from perror: Bad file descriptor
> > > > > rm_l_2_16806: (1.024331) net_send: could not write to fd=6, 
> > > > errno = 9
> > > > > rm_l_2_16806:  p4_error: net_send write: -1
> > > > > Broken pipe
> > > > > length of beamline PAR per pass: 3.066670000001400e+01 m
> > > > > statistics:    ET:     00:00:01 CP:    0.09 BIO:0 DIO:0 
> PF:0 
> > MEM:0> > > p3_15201:  p4_error: net_recv read:  probable EOF on 
> > socket: 1
> > > > > Broken pipe
> > > > > 
> > > > > I can't find the reason of this problem. The same thing 
> > > happened on
> > > > > another cluster. The totalview debugger didn't give me too 
> > much 
> > > > useful> information. The survived processes just stuck at an 
> > > > MPI_Barrier> command. 
> > > > > 
> > > > > Can someone give me some hint to fixed the problem 
> > > according to the
> > > > > error information given above?
> > > > > 
> > > > > The working directory is:
> > > > >  /home/ywang/elegantRuns/script3/
> > > > > The command I used:
> > > > > mpirun -np 4 -machinefile $PBS_NODEFILE 
> > > > /home/ywang/oag/apps/bin/linux-
> > > > > x86/Pelegant run.ele
> > > > > 
> > > > > Thanks in advance,
> > > > > 
> > > > > Yusong Wang
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 




More information about the mpich-discuss mailing list