[MPICH] Re: collective abort of all ranks

Kamaraju Kusumanchi kamaraju at gmail.com
Wed Jun 13 08:14:17 CDT 2007


This is what I did in my previous email. I ran 4 processes on each
node locally and ran the program on each node. In this case the code
executes without any errors on all the nodes. The problem shows up
only when I boot 4 processes across the nodes (one on each node).

raju

On 6/12/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> There could be a problem with versions of shared libraries that the f90
> compiler uses. Since there are only 4 nodes, can you login to each and try
> to run 4 processes locally. If one of them fails, that node is suspect.
>
> Rajeev
>
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of
> > Kamaraju Kusumanchi
> > Sent: Tuesday, June 12, 2007 1:25 PM
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [MPICH] Re: collective abort of all ranks
> >
> > On 6/12/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > > There might be some problem on one of the nodes. Can you try running
> > > individually on each of the nodes (not across the nodes).
> > >
> > > Rajeev
> > >
> >
> > When the program is run individually on each nodes, it works fine. I
> > have tried it on all the individual nodes. The problem comes up only
> > when I run it across the nodes.
> >
> > AFAIK, the only difference across the nodes is that their times are
> > not synchronized.
> >
> > node1Tue Jun 12 14:18:07 EDT 2007
> > node2Tue Jun 12 08:49:17 EDT 2007
> > node3Tue Jun 12 14:19:49 EDT 2007
> > node4Tue Jun 12 13:48:01 EDT 2007
> >
> > I asked the administrator to synchronize the timings. I will inform
> > here if synchronizing the timings has any affect on the code's
> > behavior.
> >
> > raju
> >
> >
>
>




More information about the mpich-discuss mailing list