[MPICH] Re: collective abort of all ranks

Anthony Chan chan at mcs.anl.gov
Fri Jun 15 16:55:07 CDT 2007


Hi Kamaraju,

Wonder if you have made any progress on your problem ?

If not, could you tell us a bit more about each node that you are doing
this experiment on ?  e.g. what OS and kernel that each node uses ? and
what stacksize on each node... ?

A.Chan

On Wed, 13 Jun 2007, Kamaraju Kusumanchi wrote:

> This is what I did in my previous email. I ran 4 processes on each
> node locally and ran the program on each node. In this case the code
> executes without any errors on all the nodes. The problem shows up
> only when I boot 4 processes across the nodes (one on each node).
>
> raju
>
> On 6/12/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > There could be a problem with versions of shared libraries that the f90
> > compiler uses. Since there are only 4 nodes, can you login to each and try
> > to run 4 processes locally. If one of them fails, that node is suspect.
> >
> > Rajeev
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of
> > > Kamaraju Kusumanchi
> > > Sent: Tuesday, June 12, 2007 1:25 PM
> > > Cc: mpich-discuss at mcs.anl.gov
> > > Subject: Re: [MPICH] Re: collective abort of all ranks
> > >
> > > On 6/12/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > > > There might be some problem on one of the nodes. Can you try running
> > > > individually on each of the nodes (not across the nodes).
> > > >
> > > > Rajeev
> > > >
> > >
> > > When the program is run individually on each nodes, it works fine. I
> > > have tried it on all the individual nodes. The problem comes up only
> > > when I run it across the nodes.
> > >
> > > AFAIK, the only difference across the nodes is that their times are
> > > not synchronized.
> > >
> > > node1Tue Jun 12 14:18:07 EDT 2007
> > > node2Tue Jun 12 08:49:17 EDT 2007
> > > node3Tue Jun 12 14:19:49 EDT 2007
> > > node4Tue Jun 12 13:48:01 EDT 2007
> > >
> > > I asked the administrator to synchronize the timings. I will inform
> > > here if synchronizing the timings has any affect on the code's
> > > behavior.
> > >
> > > raju
> > >
> > >
> >
> >
>
>




More information about the mpich-discuss mailing list