[mpich-discuss] MPI error, error stack

wzlu wzlu at gate.sinica.edu.tw
Wed Jul 2 20:28:25 CDT 2008


Hi,

Thanks for your responds.

I will ask the program owner first, if he agree I will send the program 
to you.

I also have read the man page in my system(RHEL 5). The signal 8 is 
"Floating point exception".
Does it mean some code in the program cause this error? And I have to 
trace the code to solve the program?

I know the program runs well in AIX 5.3, maybe run well in linux too.

Best Regards.
Lu

Jayesh Krishna 提到:
>
> Hi,
> Can you send us a test program that fails ?
> You might also want to look more into the error message,
>
> ===================
> rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective 
> abort of all ranks exit status of rank 18: killed by signal 8 rank 15 
> in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort of 
> all ranks exit status of rank 15: return code 1
> ===================
>
> and see what the signal 8 refers to in your system (possibly a 
> floating point exception).
>
> Regards,
> Jayesh
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of wzlu
> Sent: Wednesday, July 02, 2008 3:58 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] MPI error, error stack
>
> Hi, all
>
> I used mpich 2 to run my job. And I got following error message.
> I have test cpi without any error message.
> The error cause by network? or other? Thanks a lot.
>
> Best Regards,
> Lu
>
> [cli_15]: aborting job:
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(242)..........................: MPI_Waitall(count=10, 
> req_array=0x11e9a90, status_array=0x11e9990) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while 
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(413):
> MPIDU_Socki_handle_read(633)..............: connection failure 
> (set=0,sock=14,errno=104:Connection reset by peer)
>
> cpu real user sys ratio node
> 0* 0.40 0.01 0.01 6% in04035.pcf.sinica.edu.tw total 0.40 0.01 0.01 0.06x
>
> memory local global res size pag flts pag flts voluntary involunt heap 
> heap (pages) minor major switches switches
> 0* 3MB 1KB 0 2135 18 854 5
> total 3MB 1KB 0 2135 18 854 5
>
> messages send send send recv recv recv copy copy copy cnt total avg 
> cnt total avg cnt total avg
> 0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective 
> abort of all ranks exit status of rank 18: killed by signal 8 rank 15 
> in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort of 
> all ranks exit status of rank 15: return code 1
> [cli_13]: aborting job:
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(242)..........................: MPI_Waitall(count=6, 
> req_array=0x11e9a40, status_array=0x11e9990) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while 
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(413):
> MPIDU_Socki_handle_read(633)..............: connection failure 
> (set=0,sock=7,errno=104:Connection reset by peer)
>
> cpu real user sys ratio node
> 0* 0.40 0.01 0.03 9% in04037.pcf.sinica.edu.tw total 0.40 0.01 0.03 0.09x
>
> memory local global res size pag flts pag flts voluntary involunt heap 
> heap (pages) minor major switches switches
> 0* 3MB 1KB 0 2021 19 846 6
> total 3MB 1KB 0 2021 19 846 6
>
> messages send send send recv recv recv copy copy copy cnt total avg 
> cnt total avg cnt total avg
> 0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
>




More information about the mpich-discuss mailing list