[MPICH] Error handling issue

Jayesh Krishna jayesh at mcs.anl.gov
Wed Nov 14 10:01:59 CST 2007


Hi,
 KIlling an MPI process results in the process manager aborting all the MPI
processes associated with the current job. This error is not an MPI error
(The errhandler associated with a communicator can only handle MPI errors).
The error message is then printed by the process manager (SMPD in the case
of windows). 
 I believe what you need is an MPI library implementation where you could
kill one of the MPI processes and still have the MPI job running (the
remaining MPI processes running). You could modify the source code of SMPD
to do that. 
 
(PS: Strictly speaking this level of fault tolerance should be handled, if
possible, at the process manager/library level -- not in the user code.)
 
Regards,
Jayesh

  _____  

From: AGPX [mailto:agpxnet at yahoo.it] 
Sent: Wednesday, November 14, 2007 7:39 AM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [MPICH] Error handling issue


Hi,

I abort the process by killing the process (from Task Manager). Basically,
my application (on the so called 'main machine', ID = 0) distribute his
calculation on various machines (called the 'evaluators'). When an evaluator
abort for any reason (could be also a blackout) I need to handle this
situation in order to delegate his calculation to another evaluator, so that
I can avoid to lost calculations already done by the other evaluators.
Actually, when an evaluator abort (I have tried to kill the process), the
main process (ID = 0) abort with the error message described and this is a
serious problem for me.

Best regards,

Gianluca Arcidiacono



----- Messaggio originale -----
Da: Jayesh Krishna <jayesh at mcs.anl.gov>
A: AGPX <agpxnet at yahoo.it>
Cc: mpich-discuss at mcs.anl.gov
Inviato: Lunedì 12 novembre 2007, 17:29:52
Oggetto: RE: [MPICH] Error handling issue


Hi,
 This could probably be an error message given by the process manager.
 How are you aborting the process?
 
Regards,
Jayesh


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of AGPX
Sent: Sunday, November 11, 2007 6:37 AM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] Error handling issue


Hi,

I have write the following code wishing to avoid my main process to abort on
an MPI error:

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &MPIId);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

but when I try to terminate a job process on another machine (pcamd3000 is
the main machine, pcamd2600 the other. I use Windows XP Pro on both), then
the main process abort. Here the error message:

job aborted:
rank: node: exit code[: error message]
0: pcamd3000: 1: Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=00B458B0, count=1,
MPI_
INT, dest=1, tag=0, comm=0x84000000) failed
MPIDI_CH3I_Progress(148)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(497):
MPIDU_Sock_wait(2603).....................: Il nome di rete specificato non
è più disponibile. (errno 64)
1: pcamd2600: 1: process 1 exited without calling finalize
2: pcamd2600: 1

(note that the message:  'Il nome di rete specificato non è più
disponibile.' in english is: 'The network name specified is no more
available'.)

What I miss? I have more than one communicator, but I have used
MPI_Comm_set_errhandler as well to set their error handler to
MPI_ERRORS_RETURN. The code is:

...
MPI_Group_incl(worldGroup, nRanks, ranks, &handle.group);
MPI_Comm_create(MPI_COMM_WORLD, handle.group, &handle.comm);
MPI_Comm_set_errhandler(handle.comm, MPI_ERRORS_RETURN);
...

I have also tried with MPI_Errhandler_set, but this doesn't help:

MPI_Errhandler_set(..., MPI_ERRORS_RETURN);

Any suggestion?

Thanks,

- AGPX




  _____  

  _____  

L'email della prossima generazione? Puoi averla con la nuova Yahoo!
<http://us.rd.yahoo.com/mail/it/taglines/hotmail/nowyoucan/nextgen/*http://i
t.docs.yahoo.com/nowyoucan.html> Mail


  _____  

  _____  

L'email della prossima generazione? Puoi averla con la nuova
<http://us.rd.yahoo.com/mail/it/taglines/hotmail/nowyoucan/nextgen/*http://i
t.docs.yahoo.com/nowyoucan.html> Yahoo! Mail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20071114/dce15f85/attachment.htm>


More information about the mpich-discuss mailing list