[mpich-discuss] MPI error, error stack

Jayesh Krishna jayesh at mcs.anl.gov
Wed Jul 2 11:07:12 CDT 2008


Hi,
 Can you send us a test program that fails ?
 You might also want to look more into the error message,

===================
rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort
of all ranks exit status of rank 18: killed by signal 8 rank 15 in job 1
in04033.pcf.sinica.edu.tw_53415 caused collective abort of all ranks exit
status of rank 15: return code 1 
===================

 and see what the signal 8 refers to in your system (possibly a floating
point exception).

Regards,
Jayesh
-----Original Message-----
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of wzlu
Sent: Wednesday, July 02, 2008 3:58 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] MPI error, error stack

Hi, all

I used mpich 2 to run my job. And I got following error message.
I have test cpi without any error message.
The error cause by network? or other? Thanks a lot.

Best Regards,
Lu

[cli_15]: aborting job:
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..........................: MPI_Waitall(count=10,
req_array=0x11e9a90, status_array=0x11e9990) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..............: connection failure
(set=0,sock=14,errno=104:Connection reset by peer)

cpu real user sys ratio node
0* 0.40 0.01 0.01 6% in04035.pcf.sinica.edu.tw total 0.40 0.01 0.01 0.06x

memory local global res size pag flts pag flts voluntary involunt heap
heap (pages) minor major switches switches
0* 3MB 1KB 0 2135 18 854 5
total 3MB 1KB 0 2135 18 854 5

messages send send send recv recv recv copy copy copy cnt total avg cnt
total avg cnt total avg
0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort
of all ranks exit status of rank 18: killed by signal 8 rank 15 in job 1
in04033.pcf.sinica.edu.tw_53415 caused collective abort of all ranks exit
status of rank 15: return code 1
[cli_13]: aborting job:
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..........................: MPI_Waitall(count=6,
req_array=0x11e9a40, status_array=0x11e9990) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..............: connection failure
(set=0,sock=7,errno=104:Connection reset by peer)

cpu real user sys ratio node
0* 0.40 0.01 0.03 9% in04037.pcf.sinica.edu.tw total 0.40 0.01 0.03 0.09x

memory local global res size pag flts pag flts voluntary involunt heap
heap (pages) minor major switches switches
0* 3MB 1KB 0 2021 19 846 6
total 3MB 1KB 0 2021 19 846 6

messages send send send recv recv recv copy copy copy cnt total avg cnt
total avg cnt total avg
0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080702/0079ea46/attachment.htm>


More information about the mpich-discuss mailing list