[MPICH] error information----tcp port TIME_WAIT?
Yusong Wang
ywang25 at aps.anl.gov
Fri May 12 11:09:10 CDT 2006
I am experiencing a more serious problem with another case from my regression test suite.
Even MPICH-GM can't survive this time.
In the previous test (script3), I found some tcp ports were still alive (TIME_WAIT) after the program quits normally. This could explain why I failed to run the same code several times without a break.
I heard Linux has a 1 minute timeout on TCP sockets. For this new test, I wonder if the application program opened too many ports, will it crash? In what situations the MPI program could leave some unclosed ports? (I tested with the cpi, and didn't find any ports left.) How can I improve my program to avoid this problem? Or is there any parameter I can control?
I attached some information generated with netstat command before and after running my program.
Thanks for your advice!
Yusong
--------------------------------------- before running my code -------------------------------------------
[ywang at j73 dscatter3]$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 j73.lcrc.anl.gov:33049 jpvfs1.lcrc.anl.go:3000 ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33600 jpvfs5.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33602 jpvfs7.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33596 jpvfs1.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33598 jpvfs3.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:34087 jmayor6.lcrc.anl.g:5140 ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33601 jpvfs6.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33603 jpvfs8.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33597 jpvfs2.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33599 jpvfs4.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:34644 jlogin2.lcrc.anl.:58423 ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:ssh jlogin2.lcrc.anl.:58500 ESTABLISHED
Active UNIX domain sockets (w/o servers)
--------------------------------------- after running my code -------------------------------------------
[ywang at j73 dscatter3]$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 j73.lcrc.anl.gov:34696 j73.lcrc.anl.gov:34699 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:34700 j73.lcrc.anl.gov:34698 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:34702 j75.lcrc.anl.gov:34979 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:33049 jpvfs1.lcrc.anl.go:3000 ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33600 jpvfs5.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33602 jpvfs7.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33596 jpvfs1.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33598 jpvfs3.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:1015 j117.lcrc.anl.gov:1023 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:1013 j117.lcrc.anl.gov:1022 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:34087 jmayor6.lcrc.anl.g:5140 ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:34703 j117.lcrc.anl.gov:34925 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:1019 j73.lcrc.anl.gov:1010 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:1007 j73.lcrc.anl.gov:1008 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:shell j73.lcrc.anl.gov:1009 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:1022 j73.lcrc.anl.gov:1021 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:34701 j74.lcrc.anl.gov:35070 TIME_WAIT
tcp 0 0 j73.lcrc.anl.gov:33601 jpvfs6.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33603 jpvfs8.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33597 jpvfs2.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:33599 jpvfs4.:afs3-fileserver ESTABLISHED
tcp 0 0 j73.lcrc.anl.gov:34644 jlogin2.lcrc.anl.:58423 ESTABLISHED
tcp 0 1536 j73.lcrc.anl.gov:ssh jlogin2.lcrc.anl.:58500 ESTABLISHED
----- Original Message -----
From: Yusong Wang <ywang25 at aps.anl.gov>
Date: Wednesday, May 10, 2006 11:13 pm
Subject: Re: RE: [MPICH] error information
> This is interesting!
> After I switched to the MPICH-GM, the problem was gone. I can
> repeat the tests 30 times without the problem. It seems MPICH-GM is
> much more stable than the MPICH_P4. While our cluster is consist of
> 40 AMD SMP processors and 60 Intel SMP processors, with Sun Grid
> Engine. The MPICH_P4 seems to be the only version we can use if we
> can't get the MPICH2 work for our system. This is also could be the
> situation for the users of our code.
>
> Thanks for your help.
>
> Yusong
>
> ----- Original Message -----
> From: Rajeev Thakur <thakur at mcs.anl.gov>
> Date: Wednesday, May 10, 2006 8:22 pm
> Subject: RE: [MPICH] error information
>
> > You should be able to use MPICH-GM on jazz with the gcc compiler.
> > You might
> > need to specify the right field in your .soft environment. See
> > http://www.lcrc.anl.gov/faq/cache/54.html for example.
> >
> > Rajeev
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Yusong Wang
> > > Sent: Wednesday, May 10, 2006 6:42 PM
> > > To: Rusty Lusk
> > > Cc: mpich-discuss at mcs.anl.gov
> > > Subject: Re: [MPICH] error information
> > >
> > > I may need wait some days before I can run it under MPICH2. I
> > > was able to run the program with command line under MPICH2
> > > environment on our cluster. Our system administrator was
> > > trying to integrate MPICH2 with Sun Grid Engine, but stuck at
> > > the use of smpd. Right Now, I can't run the program with
> > > MPICH2 during the update. It seems to me there is no gcc
> > > based MPICH2 available on Jazz and our code can only be
> > > compiled with gcc compiler.
> > >
> > > The problem comes from a regression test of 100 cases. If I
> > > run them one by one (with some break time between each run),
> > > I would not expect this problem. It seems to me some
> > > operations have not been done although the previous run quit
> > > normally.
> > >
> > > Thanks,
> > >
> > > Yusong
> > >
> > > ----- Original Message -----
> > > From: Rusty Lusk <lusk at mcs.anl.gov>
> > > Date: Wednesday, May 10, 2006 4:34 pm
> > > Subject: Re: [MPICH] error information
> > >
> > > > You are using a very old version of MPICH. Can you use MPICH2?
> > > > It might give you better information on termination.
> > > >
> > > > Regards,
> > > > Rusty Lusk
> > > >
> > > > From: Yusong Wang <ywang25 at aps.anl.gov>
> > > > Subject: [MPICH] error information
> > > > Date: Wed, 10 May 2006 16:27:13 -0500
> > > >
> > > > > Hi,
> > > > >
> > > > > I repeated a same test several times on Jazz. Most times it
> > > > works fine,
> > > > > occasionally (1 out of 5 runs), I got the following errors:
> > > > >
> > > > > /soft/apps/packages/mpich-p4-1.2.6-gcc-3.2.3-1/bin/mpirun:
> > line
> > > > 1: 24600
> > > > > Broken pipe
> > > /home/ywang/oag/apps/bin/linux-x86/Pelegant
> > > > > "run.ele" -p4pg /home/ywang/elegantRuns/script3/PI24473 -
> > > > > p4wd /home/ywang/elegantRuns/script3
> > > > > p4_error: latest msg from perror: Bad file descriptor
> > > > > rm_l_2_16806: (1.024331) net_send: could not write to fd=6,
> > > > errno = 9
> > > > > rm_l_2_16806: p4_error: net_send write: -1
> > > > > Broken pipe
> > > > > length of beamline PAR per pass: 3.066670000001400e+01 m
> > > > > statistics: ET: 00:00:01 CP: 0.09 BIO:0 DIO:0
> PF:0
> > MEM:0> > > p3_15201: p4_error: net_recv read: probable EOF on
> > socket: 1
> > > > > Broken pipe
> > > > >
> > > > > I can't find the reason of this problem. The same thing
> > > happened on
> > > > > another cluster. The totalview debugger didn't give me too
> > much
> > > > useful> information. The survived processes just stuck at an
> > > > MPI_Barrier> command.
> > > > >
> > > > > Can someone give me some hint to fixed the problem
> > > according to the
> > > > > error information given above?
> > > > >
> > > > > The working directory is:
> > > > > /home/ywang/elegantRuns/script3/
> > > > > The command I used:
> > > > > mpirun -np 4 -machinefile $PBS_NODEFILE
> > > > /home/ywang/oag/apps/bin/linux-
> > > > > x86/Pelegant run.ele
> > > > >
> > > > > Thanks in advance,
> > > > >
> > > > > Yusong Wang
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
More information about the mpich-discuss
mailing list