[MPICH] Re: collective abort of all ranks
Rajeev Thakur
thakur at mcs.anl.gov
Tue Jun 12 13:51:54 CDT 2007
There could be a problem with versions of shared libraries that the f90
compiler uses. Since there are only 4 nodes, can you login to each and try
to run 4 processes locally. If one of them fails, that node is suspect.
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of
> Kamaraju Kusumanchi
> Sent: Tuesday, June 12, 2007 1:25 PM
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] Re: collective abort of all ranks
>
> On 6/12/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > There might be some problem on one of the nodes. Can you try running
> > individually on each of the nodes (not across the nodes).
> >
> > Rajeev
> >
>
> When the program is run individually on each nodes, it works fine. I
> have tried it on all the individual nodes. The problem comes up only
> when I run it across the nodes.
>
> AFAIK, the only difference across the nodes is that their times are
> not synchronized.
>
> node1Tue Jun 12 14:18:07 EDT 2007
> node2Tue Jun 12 08:49:17 EDT 2007
> node3Tue Jun 12 14:19:49 EDT 2007
> node4Tue Jun 12 13:48:01 EDT 2007
>
> I asked the administrator to synchronize the timings. I will inform
> here if synchronizing the timings has any affect on the code's
> behavior.
>
> raju
>
>
More information about the mpich-discuss
mailing list