[MPICH] error information

Yusong Wang ywang25 at aps.anl.gov
Wed May 10 18:41:54 CDT 2006


I may need wait some days before I can run it under MPICH2. I was able to run the program with command line under MPICH2 environment on our cluster. Our system administrator was trying to integrate MPICH2 with Sun Grid Engine, but stuck at the use of smpd. Right Now,  I can't run the program with MPICH2 during the update.  It seems to me there is no gcc based MPICH2 available on Jazz and our code can only be compiled with gcc compiler.

The problem comes from a regression test of 100 cases. If I run them one by one (with some break time between each run), I would not expect this problem. It seems to me some operations have not been done although the previous run quit normally.   

Thanks,

Yusong

----- Original Message -----
From: Rusty Lusk <lusk at mcs.anl.gov>
Date: Wednesday, May 10, 2006 4:34 pm
Subject: Re: [MPICH] error information

> You are using a very old version of MPICH.  Can you use MPICH2?
> It might give you better information on termination.
> 
> Regards,
> Rusty Lusk
> 
> From: Yusong Wang <ywang25 at aps.anl.gov>
> Subject: [MPICH] error information
> Date: Wed, 10 May 2006 16:27:13 -0500
> 
> > Hi,
> > 
> > I repeated a same test several times on Jazz. Most times it 
> works fine,
> > occasionally (1 out of 5 runs), I got the following errors:
> > 
> > /soft/apps/packages/mpich-p4-1.2.6-gcc-3.2.3-1/bin/mpirun: line 
> 1: 24600
> > Broken pipe             /home/ywang/oag/apps/bin/linux-x86/Pelegant
> > "run.ele" -p4pg /home/ywang/elegantRuns/script3/PI24473 -
> > p4wd /home/ywang/elegantRuns/script3
> >     p4_error: latest msg from perror: Bad file descriptor
> > rm_l_2_16806: (1.024331) net_send: could not write to fd=6, 
> errno = 9
> > rm_l_2_16806:  p4_error: net_send write: -1
> > Broken pipe
> > length of beamline PAR per pass: 3.066670000001400e+01 m
> > statistics:    ET:     00:00:01 CP:    0.09 BIO:0 DIO:0 PF:0 MEM:0
> > p3_15201:  p4_error: net_recv read:  probable EOF on socket: 1
> > Broken pipe
> > 
> > I can't find the reason of this problem. The same thing happened on
> > another cluster. The totalview debugger didn't give me too much 
> useful> information. The survived processes just stuck at an 
> MPI_Barrier> command. 
> > 
> > Can someone give me some hint to fixed the problem according to the
> > error information given above?
> > 
> > The working directory is:
> >  /home/ywang/elegantRuns/script3/
> > The command I used:
> > mpirun -np 4 -machinefile $PBS_NODEFILE 
> /home/ywang/oag/apps/bin/linux-
> > x86/Pelegant run.ele
> > 
> > Thanks in advance,
> > 
> > Yusong Wang
> > 
> 
> 




More information about the mpich-discuss mailing list