[MPICH] Error handling issue

AGPX agpxnet at yahoo.it
Wed Nov 14 10:47:53 CST 2007


Hi,

thanks for the reply. I think that only at the user code
level one can actuate a recovery strategy. In my scenario, I (wish)
simply delegate the computation of the failed evaluator to another one.
Main process must be aware of this and so it must be the handler of
that error. In my opinion, SMPD shouldn't abort all the process if one
fail. Any eventual MPI command (Send/Recv asynchronous/pending or not)
must fail when the relative process is died. That is: if the process 0
wait for a response from process 1 (through a Recv), if the process 1
die, the Recv must fail returing an error code. If the process 0 abort,
then all the job abort too.

Best regards,

Gianluca Arcidiacono

----- Messaggio originale -----
Da: Jayesh Krishna <jayesh at mcs.anl.gov>
A: AGPX <agpxnet at yahoo.it>
Cc: mpich-discuss at mcs.anl.gov
Inviato: Mercoledì 14 novembre 2007, 17:01:59
Oggetto: RE: [MPICH] Error handling issue



 
DIV {
MARGIN:0px;}



Hi,

 KIlling an MPI process results in the process manager 
aborting all the MPI processes associated with the current job. This error is 
not an MPI error (The errhandler associated with a communicator can only handle 
MPI errors). The error message is then printed by the process manager (SMPD in 
the case of windows). 

 I believe what you need is an MPI library 
implementation where you could kill one of the MPI processes and still have the 
MPI job running (the remaining MPI processes running). You could modify the 
source code of SMPD to do that. 

 

(PS: Strictly speaking this level of fault tolerance should 
be handled, if possible, at the process manager/library level -- not in the user 
code.)

 

Regards,

Jayesh




From: AGPX [mailto:agpxnet at yahoo.it] 

Sent: Wednesday, November 14, 2007 7:39 AM
To: Jayesh 
Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [MPICH] 
Error handling issue






Hi,

I abort the 
process by killing the process (from Task Manager). Basically, my application 
(on the so called 'main machine', ID = 0) distribute his calculation on various 
machines (called the 'evaluators'). When an evaluator abort for any reason 
(could be also a blackout) I need to handle this situation in order to delegate 
his calculation to another evaluator, so that I can avoid to lost calculations 
already done by the other evaluators. Actually, when an evaluator abort (I have 
tried to kill the process), the main process (ID = 0) abort with the error 
message described and this is a serious problem for me.

Best 
regards,

Gianluca 
Arcidiacono



----- 
Messaggio originale -----
Da: Jayesh Krishna <jayesh at mcs.anl.gov>
A: 
AGPX <agpxnet at yahoo.it>
Cc: mpich-discuss at mcs.anl.gov
Inviato: 
Lunedì 12 novembre 2007, 17:29:52
Oggetto: RE: [MPICH] Error handling 
issue


DIV {
MARGIN:0px;}


Hi,

 This could probably be an error message given by 
the process manager.

 How are you aborting the 
process?

 

Regards,

Jayesh




From: owner-mpich-discuss at mcs.anl.gov 
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of 
AGPX
Sent: Sunday, November 11, 2007 6:37 AM
To: 
mpich-discuss at mcs.anl.gov
Subject: [MPICH] Error handling 
issue






Hi,

I have write the following code wishing to avoid my main process 
to abort on an MPI error:

MPI_Init(&argc, 
&argv);
MPI_Comm_rank(MPI_COMM_WORLD, 
&MPIId);
MPI_Comm_size(MPI_COMM_WORLD, 
&numprocs);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, 
MPI_ERRORS_RETURN);

but 
when I try to terminate a job process on another machine (pcamd3000 is the main 
machine, pcamd2600 the other. I use Windows XP Pro on both), then the main 
process abort. Here the error message:

job 
aborted:
rank: 
node: exit code[: error message]
0: 
pcamd3000: 1: Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: 
MPI_Send(buf=00B458B0, count=1, MPI_
INT, 
dest=1, tag=0, comm=0x84000000) failed
MPIDI_CH3I_Progress(148)..................: 
handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(497):
MPIDU_Sock_wait(2603).....................: 
Il nome di rete specificato non è più 
disponibile. (errno 64)
1: 
pcamd2600: 1: process 1 exited without calling finalize
2: 
pcamd2600: 1

(note 
that the message:  'Il 
nome di rete specificato non è più 
disponibile.' in english is: 'The network name specified is no more 
available'.)

What I miss? I have more than one communicator, but I have 
used MPI_Comm_set_errhandler as well to set their error handler to 
MPI_ERRORS_RETURN. The code is:

...
MPI_Group_incl(worldGroup, 
nRanks, ranks, &handle.group);
MPI_Comm_create(MPI_COMM_WORLD, 
handle.group, &handle.comm);
MPI_Comm_set_errhandler(handle.comm, 
MPI_ERRORS_RETURN);
...

I 
have also tried with MPI_Errhandler_set, but this doesn't help:

MPI_Errhandler_set(..., 
MPI_ERRORS_RETURN);

Any suggestion?

Thanks,

- 
AGPX









L'email della prossima generazione? Puoi averla con la 
nuova Yahoo! 
Mail








L'email della prossima generazione? Puoi averla con la 
nuova 
Yahoo! Mail





      ___________________________________ 
L'email della prossima generazione? Puoi averla con la nuova Yahoo! Mail: http://it.docs.yahoo.com/nowyoucan.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20071114/4abb1f3c/attachment.htm>


More information about the mpich-discuss mailing list