[MPICH] Error handling issue

Darius Buntinas buntinas at mcs.anl.gov
Sun Nov 11 08:32:50 CST 2007


This may be a problem with the process manager (SMPD).  When one process 
dies without calling MPI_Finalize (e.g., because it was killed), then 
the process manager detects this as an error and kills the whole job.

That doesn't explain why you're getting the error message though.  It 
looks like you're setting the errhandler correctly.

-d

On 11/11/2007 06:36 AM, AGPX wrote:
> Hi,
> 
> I have write the following code wishing to avoid my main process to 
> abort on an MPI error:
> 
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &MPIId);
> MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> 
> but when I try to terminate a job process on another machine (pcamd3000 
> is the main machine, pcamd2600 the other. I use Windows XP Pro on both), 
> then the main process abort. Here the error message:
> 
> job aborted:
> rank: node: exit code[: error message]
> 0: pcamd3000: 1: Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173).............................: MPI_Send(buf=00B458B0, 
> count=1, MPI_
> INT, dest=1, tag=0, comm=0x84000000) failed
> MPIDI_CH3I_Progress(148)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(497):
> MPIDU_Sock_wait(2603).....................: Il nome di rete specificato 
> non è più disponibile. (errno 64)
> 1: pcamd2600: 1: process 1 exited without calling finalize
> 2: pcamd2600: 1
> 
> (note that the message:  'Il nome di rete specificato non è più 
> disponibile.' in english is: 'The network name specified is no more 
> available'.)
> 
> What I miss? I have more than one communicator, but I have used 
> MPI_Comm_set_errhandler as well to set their error handler to 
> MPI_ERRORS_RETURN. The code is:
> 
> ...
> MPI_Group_incl(worldGroup, nRanks, ranks, &handle.group);
> MPI_Comm_create(MPI_COMM_WORLD, handle.group, &handle.comm);
> MPI_Comm_set_errhandler(handle.comm, MPI_ERRORS_RETURN);
> ...
> 
> I have also tried with MPI_Errhandler_set, but this doesn't help:
> 
> MPI_Errhandler_set(..., MPI_ERRORS_RETURN);
> 
> Any suggestion?
> 
> Thanks,
> 
> - AGPX
> 
> 
> 
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> L'email della prossima generazione? Puoi averla con la nuova Yahoo! Mail 
> <http://us.rd.yahoo.com/mail/it/taglines/hotmail/nowyoucan/nextgen/*http://it.docs.yahoo.com/nowyoucan.html>




More information about the mpich-discuss mailing list