[MPICH] Re: collective abort of all ranks

Rajeev Thakur thakur at mcs.anl.gov
Tue Jun 12 13:51:54 CDT 2007


There could be a problem with versions of shared libraries that the f90
compiler uses. Since there are only 4 nodes, can you login to each and try
to run 4 processes locally. If one of them fails, that node is suspect.

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of 
> Kamaraju Kusumanchi
> Sent: Tuesday, June 12, 2007 1:25 PM
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] Re: collective abort of all ranks
> 
> On 6/12/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > There might be some problem on one of the nodes. Can you try running
> > individually on each of the nodes (not across the nodes).
> >
> > Rajeev
> >
> 
> When the program is run individually on each nodes, it works fine. I
> have tried it on all the individual nodes. The problem comes up only
> when I run it across the nodes.
> 
> AFAIK, the only difference across the nodes is that their times are
> not synchronized.
> 
> node1Tue Jun 12 14:18:07 EDT 2007
> node2Tue Jun 12 08:49:17 EDT 2007
> node3Tue Jun 12 14:19:49 EDT 2007
> node4Tue Jun 12 13:48:01 EDT 2007
> 
> I asked the administrator to synchronize the timings. I will inform
> here if synchronizing the timings has any affect on the code's
> behavior.
> 
> raju
> 
> 




More information about the mpich-discuss mailing list