<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman,new york,times,serif;font-size:12pt"><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;">Hi,<br><br>thanks for the reply. I think that only at the user code
level one can actuate a recovery strategy. In my scenario, I (wish)
simply delegate the computation of the failed evaluator to another one.
Main process must be aware of this and so it must be the handler of
that error. In my opinion, SMPD shouldn't abort all the process if one
fail. Any eventual MPI command (Send/Recv asynchronous/pending or not)
must fail when the relative process is died. That is: if the process 0
wait for a response from process 1 (through a Recv), if the process 1
die, the Recv must fail returing an error code. If the process 0 abort,
then all the job abort too.<br><br>Best regards,<br><br>Gianluca Arcidiacono<br><br><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;">----- Messaggio originale -----<br>Da: Jayesh Krishna <jayesh@mcs.anl.gov><br>A: AGPX <agpxnet@yahoo.it><br>Cc: mpich-discuss@mcs.anl.gov<br>Inviato: Mercoledì 14 novembre 2007, 17:01:59<br>Oggetto: RE: [MPICH] Error handling issue<br><br>
<style type="text/css">DIV {
MARGIN:0px;}
</style>
<div dir="ltr" align="left"><span class="004544915-14112007"><font color="#0000ff" face="Arial" size="2">Hi,</font></span></div>
<div dir="ltr" align="left"><span class="004544915-14112007"><font color="#0000ff" face="Arial" size="2"> KIlling an MPI process results in the process manager
aborting all the MPI processes associated with the current job. This error is
not an MPI error (The errhandler associated with a communicator can only handle
MPI errors). The error message is then printed by the process manager (SMPD in
the case of windows). </font></span></div>
<div dir="ltr" align="left"><span class="004544915-14112007"><font color="#0000ff" face="Arial" size="2"> I believe what you need is an MPI library
implementation where you could kill one of the MPI processes and still have the
MPI job running (the remaining MPI processes running). You could modify the
source code of SMPD to do that.</font> </span></div>
<div dir="ltr" align="left"><span class="004544915-14112007"><font color="#0000ff" face="Arial" size="2"></font></span> </div>
<div dir="ltr" align="left"><span class="004544915-14112007"><font color="#0000ff" face="Arial" size="2">(PS: Strictly speaking this level of fault tolerance should
be handled, if possible, at the process manager/library level -- not in the user
code.)</font></span></div>
<div><font color="#0000ff" face="Arial" size="2"></font> </div>
<div><span class="004544915-14112007"></span><font face="Arial"><font color="#0000ff"><font size="2">R<span class="004544915-14112007">egards,</span></font></font></font></div>
<div><font><font color="#0000ff"><font size="2"><span class="004544915-14112007"></span></font></font></font><span class="004544915-14112007"></span><font face="Arial"><font color="#0000ff"><font size="2">J<span class="004544915-14112007">ayesh</span></font></font></font><br></div>
<div class="OutlookMessageHeader" dir="ltr" align="left" lang="en-us">
<hr tabindex="-1">
<font face="Tahoma" size="2"><b>From:</b> AGPX [mailto:agpxnet@yahoo.it]
<br><b>Sent:</b> Wednesday, November 14, 2007 7:39 AM<br><b>To:</b> Jayesh
Krishna<br><b>Cc:</b> mpich-discuss@mcs.anl.gov<br><b>Subject:</b> Re: [MPICH]
Error handling issue<br></font><br></div>
<div></div>
<div style="font-size: 12pt; font-family: courier,monaco,monospace,sans-serif;">
<div style="font-size: 12pt; font-family: courier,monaco,monospace,sans-serif;"><span style="font-family: times new roman,new york,times,serif;">Hi,<br><br>I abort the
process by killing the process (from Task Manager). Basically, my application
(on the so called 'main machine', ID = 0) distribute his calculation on various
machines (called the 'evaluators'). When an evaluator abort for any reason
(could be also a blackout) I need to handle this situation in order to delegate
his calculation to another evaluator, so that I can avoid to lost calculations
already done by the other evaluators. Actually, when an evaluator abort (I have
tried to kill the process), the main process (ID = 0) abort with the error
message described and this is a serious problem for me.<br></span><br style="font-family: times new roman,new york,times,serif;"><span style="font-family: times new roman,new york,times,serif;">Best
regards,</span><br style="font-family: times new roman,new york,times,serif;"><br style="font-family: times new roman,new york,times,serif;"><span style="font-family: times new roman,new york,times,serif;">Gianluca
Arcidiacono</span><br><br><br>
<div style="font-size: 12pt; font-family: times new roman,new york,times,serif;">-----
Messaggio originale -----<br>Da: Jayesh Krishna <jayesh@mcs.anl.gov><br>A:
AGPX <agpxnet@yahoo.it><br>Cc: mpich-discuss@mcs.anl.gov<br>Inviato:
Lunedì 12 novembre 2007, 17:29:52<br>Oggetto: RE: [MPICH] Error handling
issue<br><br>
<style type="text/css">DIV {
MARGIN:0px;}
</style>
<div dir="ltr" align="left"><font color="#0000ff" face="Arial" size="2"><span class="830242716-12112007">Hi,</span></font></div>
<div dir="ltr" align="left"><font color="#0000ff" face="Arial" size="2"><span class="830242716-12112007"> This could probably be an error message given by
the process manager.</span></font></div>
<div><font color="#0000ff" face="Arial" size="2"><span class="830242716-12112007"> How are you aborting the
process?</span></font></div>
<div><font color="#0000ff" face="Arial" size="2"><span class="830242716-12112007"></span></font> </div>
<div><span class="830242716-12112007"><font color="#0000ff" face="Arial" size="2">Regards,</font></span></div>
<div><span class="830242716-12112007"><font color="#0000ff" face="Arial" size="2">Jayesh</font></span></div><font size="2"></font><br>
<div class="OutlookMessageHeader" dir="ltr" align="left" lang="en-us">
<hr tabindex="-1">
<font face="Tahoma" size="2"><b>From:</b> owner-mpich-discuss@mcs.anl.gov
[mailto:owner-mpich-discuss@mcs.anl.gov] <b>On Behalf Of
</b>AGPX<br><b>Sent:</b> Sunday, November 11, 2007 6:37 AM<br><b>To:</b>
mpich-discuss@mcs.anl.gov<br><b>Subject:</b> [MPICH] Error handling
issue<br></font><br></div>
<div></div>
<div style="font-size: 12pt; font-family: times new roman,new york,times,serif;">
<div>Hi,<br><br>I have write the following code wishing to avoid my main process
to abort on an MPI error:<br><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Init(&argc,
&argv);</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Comm_rank(MPI_COMM_WORLD,
&MPIId);</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Comm_size(MPI_COMM_WORLD,
&numprocs);</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Comm_set_errhandler(MPI_COMM_WORLD,
<span style="font-weight: bold;">MPI_ERRORS_RETURN</span>);</span><br><br>but
when I try to terminate a job process on another machine (pcamd3000 is the main
machine, pcamd2600 the other. I use Windows XP Pro on both), then the main
process abort. Here the error message:<br><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">job
aborted:</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">rank:
node: exit code[: error message]</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">0:
pcamd3000: 1: Fatal error in MPI_Send: Other MPI error, error stack:</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Send(173).............................:
MPI_Send(buf=00B458B0, count=1, MPI_</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">INT,
dest=1, tag=0, comm=0x84000000) failed</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPIDI_CH3I_Progress(148)..................:
handle_sock_op failed</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPIDI_CH3I_Progress_handle_sock_event(497):</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPIDU_Sock_wait(2603).....................:
Il nome di rete specificato non è più</span><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">
disponibile. (errno 64)</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">1:
pcamd2600: 1: process 1 exited without calling finalize</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">2:
pcamd2600: 1</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><br>(note
that the message: '<span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">Il
nome di rete specificato non è più</span><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">
disponibile.' </span>in english is: 'The network name specified is no more
available'.)<br><br>What I miss? I have more than one communicator, but I have
used MPI_Comm_set_errhandler as well to set their error handler to
MPI_ERRORS_RETURN. The code is:<br><br><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">...</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Group_incl(worldGroup,
nRanks, ranks, &handle.group);</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Comm_create(MPI_COMM_WORLD,
handle.group, &handle.comm);</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Comm_set_errhandler(handle.comm,
<span style="font-weight: bold;">MPI_ERRORS_RETURN</span>);</span><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">...</span><br><br>I
have also tried with MPI_Errhandler_set, but this doesn't help:<br><br style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;"><span style="color: rgb(0, 0, 255); font-family: courier,monaco,monospace,sans-serif;">MPI_Errhandler_set(...,
MPI_ERRORS_RETURN);</span><br><br>Any suggestion?<br><br>Thanks,<br><br>-
AGPX<br><br><br></div></div><br>
<hr size="1">
<font face="Arial" size="2">
<hr size="1">
<font face="Arial" size="2">L'email della prossima generazione? Puoi averla con la
<a rel="nofollow" target="_blank" href="http://us.rd.yahoo.com/mail/it/taglines/hotmail/nowyoucan/nextgen/*http://it.docs.yahoo.com/nowyoucan.html">nuova Yahoo!
Mail</a></font></font></div><br></div></div><br>
<hr size="1">
<font face="Arial" size="2">
<hr size="1">
<font face="Arial" size="2">L'email della prossima generazione? Puoi averla con la
<a rel="nofollow" target="_blank" href="http://us.rd.yahoo.com/mail/it/taglines/hotmail/nowyoucan/nextgen/*http://it.docs.yahoo.com/nowyoucan.html">nuova
Yahoo! Mail</a></font></font></div><br></div></div><br>
<hr size=1><font face="Arial" size="2"><hr size=1><font face="Arial" size="2">L'email della prossima generazione? Puoi averla con la <a href="http://us.rd.yahoo.com/mail/it/taglines/hotmail/nowyoucan/nextgen/*http://it.docs.yahoo.com/nowyoucan.html">nuova Yahoo! Mail</a></font></body></html>