[mpich-discuss] cryptic (to me) error

SULLIVAN David (AREVA) David.Sullivan at areva.com
Tue Sep 7 13:55:16 CDT 2010


That's discouraging. If the network is the source of the problems, then
the mpi errors are the only symptom. I have checked the log files- no
related network errors there, I checked a few metrics (ping, traceroute,
netstat..) that show the network is up and running quite well. I have
transferred Gb's of data between the nodes, so I would have though that
would show stability. I have several phone calls in to the real network
administrators....

Thanks for the perseverance and helpful suggestions-

Dave

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
Sent: Tuesday, September 07, 2010 1:57 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] cryptic (to me) error

If I'm following you correctly, the summary.xml that you attached
supports the theory that your network is broken somehow.  The test suite
is experiencing random network failures in various MPI communication
routines, especially collective ones that are probably stressing the
network and networking stack.  There is no way that we know of to
configure/install MPICH2 such that you would experience this type of
problem, and the test suite is known good MPI code.  The problem is
almost certainly outside of MPICH2.

Check your system logs, the output from "dmesg", and any diagnostics you
have in your network switch(es).  If you don't know how to troubleshoot
networking problems, you should contact your system/network
administrators.

-Dave

On Sep 7, 2010, at 12:40 PM CDT, SULLIVAN David (AREVA) wrote:

> Had some time to work some more on this...
> I have copied the test suit folder in a NFS shared folder. The machine

> file is passed by way of HYDRA_HOST_FILE=/home/dfs/shared/mpich2 make 
> testing, as suggested. It is still running, but the results so far 
> indicate I am still messing up the process somehow.
> 
> Thanks again,
> 
> Dave
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
> Sent: Friday, September 03, 2010 2:56 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> Based on what you've sent us in the past, your mpiexec is actually 
> running mcnp.mpi.  The error occurs during an MPI_Comm_dup somewhere 
> in that code.  Running the test suite over the network will help us 
> figure out whether there is a problem with your network.
> 
> -Dave
> 
> On Sep 3, 2010, at 1:50 PM CDT, SULLIVAN David (AREVA) wrote:
> 
>> That makes sense. Since my real problem is that mpiexec doesn't get 
>> to
> 
>> starting mcnp.mpi do we need the testing suite to troubleshoot or is 
>> there a source of clues elsewhere? I can get an NFS set up and all, 
>> but the testing suite isn't my true aim so if we don't need it...
>> 
>> 
>> Dave
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
>> Sent: Friday, September 03, 2010 2:47 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> The general statement is true.  The problem is that "make testing" 
>> does not first build all executables and then second mpiexec each
> executable.
>> 
>> Instead it builds each test  (on the machine where you invoked "make
>> testing")  just before it is executed.  So the built executables only

>> end up on the node where you ran "make testing".
>> 
>> -Dave
>> 
>> On Sep 3, 2010, at 1:43 PM CDT, SULLIVAN David (AREVA) wrote:
>> 
>>> Interesting. So that would be the same for any executable that uses 
>>> mpiexec? This is confusing though because the install guide says 
>>> that
> 
>>> it can be done either as NFS or a exact duplicate. I have set this 
>>> up
> 
>>> before (as exact duplicates) without issues (with MPICH1 on WinXP) 
>>> so
> 
>>> I assumed, as it states in the install guide, this has not changed.
>>> Thanks again for the remedial assistance..
>>> Dave
>>> 
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
>>> Sent: Friday, September 03, 2010 2:03 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>> 
>>> The test suite directory must be on a shared filesystem because 
>>> mpiexec does not stage executables for you.
>>> 
>>> -Dave
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> <summary_3.txt>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list