[MPICH] Error handling issue

AGPX agpxnet at yahoo.it
Wed Nov 14 07:39:27 CST 2007


Hi,

I abort the process by killing the process (from Task Manager). Basically, my application (on the so called 'main machine', ID = 0) distribute his calculation on various machines (called the 'evaluators'). When an evaluator abort for any reason (could be also a blackout) I need to handle this situation in order to delegate his calculation to another evaluator, so that I can avoid to lost calculations already done by the other evaluators. Actually, when an evaluator abort (I have tried to kill the process), the main process (ID = 0) abort with the error message described and this is a serious problem for me.

Best regards,

Gianluca Arcidiacono


----- Messaggio originale -----
Da: Jayesh Krishna <jayesh at mcs.anl.gov>
A: AGPX <agpxnet at yahoo.it>
Cc: mpich-discuss at mcs.anl.gov
Inviato: Lunedì 12 novembre 2007, 17:29:52
Oggetto: RE: [MPICH] Error handling issue



 
DIV {
MARGIN:0px;}



Hi,

 This could probably be an error message given by 
the process manager.

 How are you aborting the 
process?

 

Regards,

Jayesh




From: owner-mpich-discuss at mcs.anl.gov 
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of 
AGPX
Sent: Sunday, November 11, 2007 6:37 AM
To: 
mpich-discuss at mcs.anl.gov
Subject: [MPICH] Error handling 
issue






Hi,

I have write the following code wishing to avoid my main process 
to abort on an MPI error:

MPI_Init(&argc, 
&argv);
MPI_Comm_rank(MPI_COMM_WORLD, 
&MPIId);
MPI_Comm_size(MPI_COMM_WORLD, 
&numprocs);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, 
MPI_ERRORS_RETURN);

but 
when I try to terminate a job process on another machine (pcamd3000 is the main 
machine, pcamd2600 the other. I use Windows XP Pro on both), then the main 
process abort. Here the error message:

job 
aborted:
rank: 
node: exit code[: error message]
0: 
pcamd3000: 1: Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: 
MPI_Send(buf=00B458B0, count=1, MPI_
INT, 
dest=1, tag=0, comm=0x84000000) failed
MPIDI_CH3I_Progress(148)..................: 
handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(497):
MPIDU_Sock_wait(2603).....................: 
Il nome di rete specificato non è più 
disponibile. (errno 64)
1: 
pcamd2600: 1: process 1 exited without calling finalize
2: 
pcamd2600: 1

(note 
that the message:  'Il 
nome di rete specificato non è più 
disponibile.' in english is: 'The network name specified is no more 
available'.)

What I miss? I have more than one communicator, but I have 
used MPI_Comm_set_errhandler as well to set their error handler to 
MPI_ERRORS_RETURN. The code is:

...
MPI_Group_incl(worldGroup, 
nRanks, ranks, &handle.group);
MPI_Comm_create(MPI_COMM_WORLD, 
handle.group, &handle.comm);
MPI_Comm_set_errhandler(handle.comm, 
MPI_ERRORS_RETURN);
...

I 
have also tried with MPI_Errhandler_set, but this doesn't help:

MPI_Errhandler_set(..., 
MPI_ERRORS_RETURN);

Any suggestion?

Thanks,

- 
AGPX









L'email della prossima generazione? Puoi averla con la 
nuova 
Yahoo! Mail





      ___________________________________ 
L'email della prossima generazione? Puoi averla con la nuova Yahoo! Mail: http://it.docs.yahoo.com/nowyoucan.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20071114/de55f4d1/attachment.htm>


More information about the mpich-discuss mailing list