[MPICH] mpirun timeout and killed by signal-2 error for 64 processor option

Anthony Chan chan at mcs.anl.gov
Mon Mar 19 10:47:08 CDT 2007



On Sun, 18 Mar 2007, Bala wrote:

> Thanks Rajeev, for the reply, we are using
> rocks cluster-4.2.1 that comes with mpich2 by default.

The error message in your 1st email is from MPICH1's p4 device.
If your cluster has mpich2 installed, then you have 2 different
MPI implementations.  Using the full path to mpich2's mpiexec
will guarantee you are using mpich2.

>
>  But still we are getting this error, we are using
> HP blade servers BL460C is tere any known issues
> with blades??

We don't have HP hardware.  What OS are you running ?  The most likely
scenario is that you may have network problem.  Try using mpich2's
mpdcheck as described in mpich2's install and user guide to detect if you
have any network problem.

A.Chan

>
> thanks,
> -bala-
>
>
> --- Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>
> > Can you try MPICH2 instead of MPICH-1? It is more
> > robust. cpi should run
> > with any number of processes.
> >
> > Rajeev
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf
> > Of Bala
> > > Sent: Saturday, March 17, 2007 8:47 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [MPICH] mpirun timeout and killed by
> > signal error
> > > for 64 processor option
> > >
> > > Hi All,
> > >         we have installed mpich on 16 node Intel
> > > X86_64
> > > dual CPU and dual core cluster( blade servers).
> > >
> > >   when we try to run mpirun with cpi sample for
> > > -np 32 option runs fine and gives the output also,
> > but
> > >
> > > after a while there is message like shown below
> > >
> > > -----------------------------
> > > pi is approximately 3.1416009869231249, Error is
> > > 0.0000083333333318
> > > wall clock time = 0.003906
> > > Timeout in waiting for processes to exit, 2 left.
> > > This may be due to a defectie rsh program (Some
> > > versions of Kerberos rsh have been observed to
> > have
> > > this problem).
> > > This is not a problem with P4 or MPICH but a
> > problem
> > > with the operating
> > > environment.  For many applications, this problem
> > will
> > > only slow down process termination.
> > > -----------------------------------
> > >
> > > but when we try to run with -np 64 and above
> > options
> > >
> > > $mpirun -np 64 -machinefile machines ./cpi
> > > we get fails with killed by signal 2 error, in our
> > > other cluster we can run with -np 64 option.
> > >
> > > pls let us know how to avoid these errors??
> > >
> > > Is it cpi is too small for -np 64 option to run??
> > >
> > > thanks in advance,
> > > -bala-
> > >
> > >
> > >
> > >
> > >
> > >
> >
> ______________________________________________________________
> > > ______________________
> > > Need Mail bonding?
> > > Go to the Yahoo! Mail Q&A for great tips from
> > Yahoo! Answers users.
> > >
> >
> http://answers.yahoo.com/dir/?link=list&sid=396546091
> > >
> > >
> >
> >
>
>
>
>
>
> ____________________________________________________________________________________
> Need Mail bonding?
> Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
> http://answers.yahoo.com/dir/?link=list&sid=396546091
>
>




More information about the mpich-discuss mailing list