[mpich-discuss]network failure during the execution of parallel program

Jayesh Krishna jayesh at mcs.anl.gov
Thu May 29 09:50:00 CDT 2008


Hi,
 These error messages are from the job launcher/process manager which use
sockets for communication. 
 You can modify the process manager code to use pipes (or other IPC)
instead of sockets (for communicating with local MPI procs and job
launcher) if you would like to be tolerant to network failures (for
*localonly* jobs).
 
(PS: The idea of using "shm" as the channel is to improve performance, not
to get away from using sockets all together.)
 
Regards,
Jayesh

  _____  

From: Seifer Lin [mailto:seiferlin at gmail.com] 
Sent: Wednesday, May 28, 2008 8:35 PM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss]network failure during the execution of
parallel program


HI,
 
I do the following test
 
D:\test_mpi\release>mpiexec -channel shm -n 4 test_mpich2.exe
iter=0, cpuid=1, ncpu=4
iter=0, cpuid=2, ncpu=4
iter=0, cpuid=3, ncpu=4
iter=0, cpuid=0, ncpu=4
iter=1, cpuid=2, ncpu=4
iter=1, cpuid=1, ncpu=4
iter=1, cpuid=0, ncpu=4
iter=1, cpuid=3, ncpu=4
iter=2, cpuid=2, ncpu=4
iter=2, cpuid=3, ncpu=4
iter=2, cpuid=1, ncpu=4
iter=2, cpuid=0, ncpu=4
op_read error on left context: generic socket failure, error stack:
MPIDU_Sock_wait(2533): The specified network name is no longer available.
(errno
 64)
unable to read the cmd header on the left context, generic socket failure,
error
 stack:
MPIDU_Sock_wait(2533): The specified network name is no longer available.
(errno
 64).
 
I unplug the network line while the iter=1 is displayed.
 
thank tou very much
 
 
 
2008/5/28 Jayesh Krishna <jayesh at mcs.anl.gov>:


 Hi,
  Specifying "shm" as the channel ensures that all MPI communication (btw
the MPI processes) is done using shared memory. The error messages that
you see could be from the process launcher or the process manager.
  Do you really need to use the "-localonly" option (Specifying the option
you might end up seeing some error messages which are handled within the
library and does not effect the MPI job)? You can run your job as "mpiexec
-channel shm -n 4 myapp.exe". Let us know if you still see the error
messages (If yes, please copy-paste the error mesgs in your email)

Regards,
Jayesh 



-----Original Message-----
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Seifer Lin
Sent: Wednesday, May 28, 2008 2:32 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss]network failure during the execution of parallel
program

Hi all:

I test a parallel program in a single machine with 4 processes.
The program only outputs ncpu and cpuid every 5 seconds
I use   mpiexec -localonly 4 myapp.exe
During the execution, I unplug the network line, and the program shows
some error messages like generic socket failure.

If I use mpiexec -channel shm -n 4 myapp.exe, and also unplug the network
line, the same error messages are showed again.
After the network is unplugged, I run the program again, and it doesn't
show any error messages.

It seems that mpiexec will detect the network status at the runtime even
the shm channel is selected.

My question is that for -channel shm, it means shared memory, and any
network state changed shouldn't affect the program using shared memory ?

I am really confused.

thanks,

Seifer Lin









-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080529/e4829c4e/attachment.htm>


More information about the mpich-discuss mailing list