[MPICH] error information

Rajeev Thakur thakur at mcs.anl.gov
Wed May 10 20:22:42 CDT 2006


You should be able to use MPICH-GM on jazz with the gcc compiler. You might
need to specify the right field in your .soft environment. See
http://www.lcrc.anl.gov/faq/cache/54.html for example.

Rajeev

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Yusong Wang
> Sent: Wednesday, May 10, 2006 6:42 PM
> To: Rusty Lusk
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] error information
> 
> I may need wait some days before I can run it under MPICH2. I 
> was able to run the program with command line under MPICH2 
> environment on our cluster. Our system administrator was 
> trying to integrate MPICH2 with Sun Grid Engine, but stuck at 
> the use of smpd. Right Now,  I can't run the program with 
> MPICH2 during the update.  It seems to me there is no gcc 
> based MPICH2 available on Jazz and our code can only be 
> compiled with gcc compiler.
> 
> The problem comes from a regression test of 100 cases. If I 
> run them one by one (with some break time between each run), 
> I would not expect this problem. It seems to me some 
> operations have not been done although the previous run quit 
> normally.   
> 
> Thanks,
> 
> Yusong
> 
> ----- Original Message -----
> From: Rusty Lusk <lusk at mcs.anl.gov>
> Date: Wednesday, May 10, 2006 4:34 pm
> Subject: Re: [MPICH] error information
> 
> > You are using a very old version of MPICH.  Can you use MPICH2?
> > It might give you better information on termination.
> > 
> > Regards,
> > Rusty Lusk
> > 
> > From: Yusong Wang <ywang25 at aps.anl.gov>
> > Subject: [MPICH] error information
> > Date: Wed, 10 May 2006 16:27:13 -0500
> > 
> > > Hi,
> > > 
> > > I repeated a same test several times on Jazz. Most times it 
> > works fine,
> > > occasionally (1 out of 5 runs), I got the following errors:
> > > 
> > > /soft/apps/packages/mpich-p4-1.2.6-gcc-3.2.3-1/bin/mpirun: line 
> > 1: 24600
> > > Broken pipe             
> /home/ywang/oag/apps/bin/linux-x86/Pelegant
> > > "run.ele" -p4pg /home/ywang/elegantRuns/script3/PI24473 -
> > > p4wd /home/ywang/elegantRuns/script3
> > >     p4_error: latest msg from perror: Bad file descriptor
> > > rm_l_2_16806: (1.024331) net_send: could not write to fd=6, 
> > errno = 9
> > > rm_l_2_16806:  p4_error: net_send write: -1
> > > Broken pipe
> > > length of beamline PAR per pass: 3.066670000001400e+01 m
> > > statistics:    ET:     00:00:01 CP:    0.09 BIO:0 DIO:0 PF:0 MEM:0
> > > p3_15201:  p4_error: net_recv read:  probable EOF on socket: 1
> > > Broken pipe
> > > 
> > > I can't find the reason of this problem. The same thing 
> happened on
> > > another cluster. The totalview debugger didn't give me too much 
> > useful> information. The survived processes just stuck at an 
> > MPI_Barrier> command. 
> > > 
> > > Can someone give me some hint to fixed the problem 
> according to the
> > > error information given above?
> > > 
> > > The working directory is:
> > >  /home/ywang/elegantRuns/script3/
> > > The command I used:
> > > mpirun -np 4 -machinefile $PBS_NODEFILE 
> > /home/ywang/oag/apps/bin/linux-
> > > x86/Pelegant run.ele
> > > 
> > > Thanks in advance,
> > > 
> > > Yusong Wang
> > > 
> > 
> > 
> 
> 




More information about the mpich-discuss mailing list