[MPICH] error information

Yusong Wang ywang25 at aps.anl.gov
Wed May 10 23:13:07 CDT 2006


This is interesting! 
After I switched to the MPICH-GM, the problem was gone. I can repeat the tests 30 times without the problem. It seems MPICH-GM is much more stable than the MPICH_P4. While our cluster is consist of 40 AMD SMP processors and 60 Intel SMP processors, with Sun Grid Engine. The MPICH_P4 seems to be the only version we can use if we can't get the MPICH2 work for our system. This is also could be the situation for the users of our code.

Thanks for your help.

Yusong 

----- Original Message -----
From: Rajeev Thakur <thakur at mcs.anl.gov>
Date: Wednesday, May 10, 2006 8:22 pm
Subject: RE: [MPICH] error information

> You should be able to use MPICH-GM on jazz with the gcc compiler. 
> You might
> need to specify the right field in your .soft environment. See
> http://www.lcrc.anl.gov/faq/cache/54.html for example.
> 
> Rajeev
> 
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Yusong Wang
> > Sent: Wednesday, May 10, 2006 6:42 PM
> > To: Rusty Lusk
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [MPICH] error information
> > 
> > I may need wait some days before I can run it under MPICH2. I 
> > was able to run the program with command line under MPICH2 
> > environment on our cluster. Our system administrator was 
> > trying to integrate MPICH2 with Sun Grid Engine, but stuck at 
> > the use of smpd. Right Now,  I can't run the program with 
> > MPICH2 during the update.  It seems to me there is no gcc 
> > based MPICH2 available on Jazz and our code can only be 
> > compiled with gcc compiler.
> > 
> > The problem comes from a regression test of 100 cases. If I 
> > run them one by one (with some break time between each run), 
> > I would not expect this problem. It seems to me some 
> > operations have not been done although the previous run quit 
> > normally.   
> > 
> > Thanks,
> > 
> > Yusong
> > 
> > ----- Original Message -----
> > From: Rusty Lusk <lusk at mcs.anl.gov>
> > Date: Wednesday, May 10, 2006 4:34 pm
> > Subject: Re: [MPICH] error information
> > 
> > > You are using a very old version of MPICH.  Can you use MPICH2?
> > > It might give you better information on termination.
> > > 
> > > Regards,
> > > Rusty Lusk
> > > 
> > > From: Yusong Wang <ywang25 at aps.anl.gov>
> > > Subject: [MPICH] error information
> > > Date: Wed, 10 May 2006 16:27:13 -0500
> > > 
> > > > Hi,
> > > > 
> > > > I repeated a same test several times on Jazz. Most times it 
> > > works fine,
> > > > occasionally (1 out of 5 runs), I got the following errors:
> > > > 
> > > > /soft/apps/packages/mpich-p4-1.2.6-gcc-3.2.3-1/bin/mpirun: 
> line 
> > > 1: 24600
> > > > Broken pipe             
> > /home/ywang/oag/apps/bin/linux-x86/Pelegant
> > > > "run.ele" -p4pg /home/ywang/elegantRuns/script3/PI24473 -
> > > > p4wd /home/ywang/elegantRuns/script3
> > > >     p4_error: latest msg from perror: Bad file descriptor
> > > > rm_l_2_16806: (1.024331) net_send: could not write to fd=6, 
> > > errno = 9
> > > > rm_l_2_16806:  p4_error: net_send write: -1
> > > > Broken pipe
> > > > length of beamline PAR per pass: 3.066670000001400e+01 m
> > > > statistics:    ET:     00:00:01 CP:    0.09 BIO:0 DIO:0 PF:0 
> MEM:0> > > p3_15201:  p4_error: net_recv read:  probable EOF on 
> socket: 1
> > > > Broken pipe
> > > > 
> > > > I can't find the reason of this problem. The same thing 
> > happened on
> > > > another cluster. The totalview debugger didn't give me too 
> much 
> > > useful> information. The survived processes just stuck at an 
> > > MPI_Barrier> command. 
> > > > 
> > > > Can someone give me some hint to fixed the problem 
> > according to the
> > > > error information given above?
> > > > 
> > > > The working directory is:
> > > >  /home/ywang/elegantRuns/script3/
> > > > The command I used:
> > > > mpirun -np 4 -machinefile $PBS_NODEFILE 
> > > /home/ywang/oag/apps/bin/linux-
> > > > x86/Pelegant run.ele
> > > > 
> > > > Thanks in advance,
> > > > 
> > > > Yusong Wang
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 




More information about the mpich-discuss mailing list