[MPICH] Error handling issue
AGPX
agpxnet at yahoo.it
Wed Nov 14 07:39:27 CST 2007
Hi,
I abort the process by killing the process (from Task Manager). Basically, my application (on the so called 'main machine', ID = 0) distribute his calculation on various machines (called the 'evaluators'). When an evaluator abort for any reason (could be also a blackout) I need to handle this situation in order to delegate his calculation to another evaluator, so that I can avoid to lost calculations already done by the other evaluators. Actually, when an evaluator abort (I have tried to kill the process), the main process (ID = 0) abort with the error message described and this is a serious problem for me.
Best regards,
Gianluca Arcidiacono
----- Messaggio originale -----
Da: Jayesh Krishna <jayesh at mcs.anl.gov>
A: AGPX <agpxnet at yahoo.it>
Cc: mpich-discuss at mcs.anl.gov
Inviato: Lunedì 12 novembre 2007, 17:29:52
Oggetto: RE: [MPICH] Error handling issue
DIV {
MARGIN:0px;}
Hi,
This could probably be an error message given by
the process manager.
How are you aborting the
process?
Regards,
Jayesh
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of
AGPX
Sent: Sunday, November 11, 2007 6:37 AM
To:
mpich-discuss at mcs.anl.gov
Subject: [MPICH] Error handling
issue
Hi,
I have write the following code wishing to avoid my main process
to abort on an MPI error:
MPI_Init(&argc,
&argv);
MPI_Comm_rank(MPI_COMM_WORLD,
&MPIId);
MPI_Comm_size(MPI_COMM_WORLD,
&numprocs);
MPI_Comm_set_errhandler(MPI_COMM_WORLD,
MPI_ERRORS_RETURN);
but
when I try to terminate a job process on another machine (pcamd3000 is the main
machine, pcamd2600 the other. I use Windows XP Pro on both), then the main
process abort. Here the error message:
job
aborted:
rank:
node: exit code[: error message]
0:
pcamd3000: 1: Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................:
MPI_Send(buf=00B458B0, count=1, MPI_
INT,
dest=1, tag=0, comm=0x84000000) failed
MPIDI_CH3I_Progress(148)..................:
handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(497):
MPIDU_Sock_wait(2603).....................:
Il nome di rete specificato non è più
disponibile. (errno 64)
1:
pcamd2600: 1: process 1 exited without calling finalize
2:
pcamd2600: 1
(note
that the message: 'Il
nome di rete specificato non è più
disponibile.' in english is: 'The network name specified is no more
available'.)
What I miss? I have more than one communicator, but I have
used MPI_Comm_set_errhandler as well to set their error handler to
MPI_ERRORS_RETURN. The code is:
...
MPI_Group_incl(worldGroup,
nRanks, ranks, &handle.group);
MPI_Comm_create(MPI_COMM_WORLD,
handle.group, &handle.comm);
MPI_Comm_set_errhandler(handle.comm,
MPI_ERRORS_RETURN);
...
I
have also tried with MPI_Errhandler_set, but this doesn't help:
MPI_Errhandler_set(...,
MPI_ERRORS_RETURN);
Any suggestion?
Thanks,
-
AGPX
L'email della prossima generazione? Puoi averla con la
nuova
Yahoo! Mail
___________________________________
L'email della prossima generazione? Puoi averla con la nuova Yahoo! Mail: http://it.docs.yahoo.com/nowyoucan.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20071114/de55f4d1/attachment.htm>
More information about the mpich-discuss
mailing list