[mpich-discuss] MPI error, error stack
wzlu
wzlu at gate.sinica.edu.tw
Wed Jul 2 20:28:25 CDT 2008
Hi,
Thanks for your responds.
I will ask the program owner first, if he agree I will send the program
to you.
I also have read the man page in my system(RHEL 5). The signal 8 is
"Floating point exception".
Does it mean some code in the program cause this error? And I have to
trace the code to solve the program?
I know the program runs well in AIX 5.3, maybe run well in linux too.
Best Regards.
Lu
Jayesh Krishna 提到:
>
> Hi,
> Can you send us a test program that fails ?
> You might also want to look more into the error message,
>
> ===================
> rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective
> abort of all ranks exit status of rank 18: killed by signal 8 rank 15
> in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort of
> all ranks exit status of rank 15: return code 1
> ===================
>
> and see what the signal 8 refers to in your system (possibly a
> floating point exception).
>
> Regards,
> Jayesh
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of wzlu
> Sent: Wednesday, July 02, 2008 3:58 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] MPI error, error stack
>
> Hi, all
>
> I used mpich 2 to run my job. And I got following error message.
> I have test cpi without any error message.
> The error cause by network? or other? Thanks a lot.
>
> Best Regards,
> Lu
>
> [cli_15]: aborting job:
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(242)..........................: MPI_Waitall(count=10,
> req_array=0x11e9a90, status_array=0x11e9990) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(413):
> MPIDU_Socki_handle_read(633)..............: connection failure
> (set=0,sock=14,errno=104:Connection reset by peer)
>
> cpu real user sys ratio node
> 0* 0.40 0.01 0.01 6% in04035.pcf.sinica.edu.tw total 0.40 0.01 0.01 0.06x
>
> memory local global res size pag flts pag flts voluntary involunt heap
> heap (pages) minor major switches switches
> 0* 3MB 1KB 0 2135 18 854 5
> total 3MB 1KB 0 2135 18 854 5
>
> messages send send send recv recv recv copy copy copy cnt total avg
> cnt total avg cnt total avg
> 0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective
> abort of all ranks exit status of rank 18: killed by signal 8 rank 15
> in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort of
> all ranks exit status of rank 15: return code 1
> [cli_13]: aborting job:
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(242)..........................: MPI_Waitall(count=6,
> req_array=0x11e9a40, status_array=0x11e9990) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(413):
> MPIDU_Socki_handle_read(633)..............: connection failure
> (set=0,sock=7,errno=104:Connection reset by peer)
>
> cpu real user sys ratio node
> 0* 0.40 0.01 0.03 9% in04037.pcf.sinica.edu.tw total 0.40 0.01 0.03 0.09x
>
> memory local global res size pag flts pag flts voluntary involunt heap
> heap (pages) minor major switches switches
> 0* 3MB 1KB 0 2021 19 846 6
> total 3MB 1KB 0 2021 19 846 6
>
> messages send send send recv recv recv copy copy copy cnt total avg
> cnt total avg cnt total avg
> 0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
>
More information about the mpich-discuss
mailing list