[mpich-discuss] MPI error, error stack

Anthony Chan chan at mcs.anl.gov
Wed Jul 2 20:57:51 CDT 2008


MPICH2 does not have much (or at all) floating point arithmetic.
Most likely the MPI program has some invalid arithmetic, e.g.
divsion by zero or sqrt of a negative number.  Just recompile your
code with -g and use a debugger to trace what causes the error.

A.Chan
----- "wzlu" <wzlu at gate.sinica.edu.tw> wrote:

> Hi,
> 
> Thanks for your responds.
> 
> I will ask the program owner first, if he agree I will send the
> program 
> to you.
> 
> I also have read the man page in my system(RHEL 5). The signal 8 is 
> "Floating point exception".
> Does it mean some code in the program cause this error? And I have to
> 
> trace the code to solve the program?
> 
> I know the program runs well in AIX 5.3, maybe run well in linux too.
> 
> Best Regards.
> Lu
> 
> Jayesh Krishna 提到:
> >
> > Hi,
> > Can you send us a test program that fails ?
> > You might also want to look more into the error message,
> >
> > ===================
> > rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective 
> > abort of all ranks exit status of rank 18: killed by signal 8 rank
> 15 
> > in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort of
> 
> > all ranks exit status of rank 15: return code 1
> > ===================
> >
> > and see what the signal 8 refers to in your system (possibly a 
> > floating point exception).
> >
> > Regards,
> > Jayesh
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of wzlu
> > Sent: Wednesday, July 02, 2008 3:58 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] MPI error, error stack
> >
> > Hi, all
> >
> > I used mpich 2 to run my job. And I got following error message.
> > I have test cpi without any error message.
> > The error cause by network? or other? Thanks a lot.
> >
> > Best Regards,
> > Lu
> >
> > [cli_15]: aborting job:
> > Fatal error in MPI_Waitall: Other MPI error, error stack:
> > MPI_Waitall(242)..........................: MPI_Waitall(count=10, 
> > req_array=0x11e9a90, status_array=0x11e9990) failed
> > MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> 
> > handling an event returned by MPIDU_Sock_Wait()
> > MPIDI_CH3I_Progress_handle_sock_event(413):
> > MPIDU_Socki_handle_read(633)..............: connection failure 
> > (set=0,sock=14,errno=104:Connection reset by peer)
> >
> > cpu real user sys ratio node
> > 0* 0.40 0.01 0.01 6% in04035.pcf.sinica.edu.tw total 0.40 0.01 0.01
> 0.06x
> >
> > memory local global res size pag flts pag flts voluntary involunt
> heap 
> > heap (pages) minor major switches switches
> > 0* 3MB 1KB 0 2135 18 854 5
> > total 3MB 1KB 0 2135 18 854 5
> >
> > messages send send send recv recv recv copy copy copy cnt total avg
> 
> > cnt total avg cnt total avg
> > 0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> > total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> > rank 18 in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective 
> > abort of all ranks exit status of rank 18: killed by signal 8 rank
> 15 
> > in job 1 in04033.pcf.sinica.edu.tw_53415 caused collective abort of
> 
> > all ranks exit status of rank 15: return code 1
> > [cli_13]: aborting job:
> > Fatal error in MPI_Waitall: Other MPI error, error stack:
> > MPI_Waitall(242)..........................: MPI_Waitall(count=6, 
> > req_array=0x11e9a40, status_array=0x11e9990) failed
> > MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> 
> > handling an event returned by MPIDU_Sock_Wait()
> > MPIDI_CH3I_Progress_handle_sock_event(413):
> > MPIDU_Socki_handle_read(633)..............: connection failure 
> > (set=0,sock=7,errno=104:Connection reset by peer)
> >
> > cpu real user sys ratio node
> > 0* 0.40 0.01 0.03 9% in04037.pcf.sinica.edu.tw total 0.40 0.01 0.03
> 0.09x
> >
> > memory local global res size pag flts pag flts voluntary involunt
> heap 
> > heap (pages) minor major switches switches
> > 0* 3MB 1KB 0 2021 19 846 6
> > total 3MB 1KB 0 2021 19 846 6
> >
> > messages send send send recv recv recv copy copy copy cnt total avg
> 
> > cnt total avg cnt total avg
> > 0* 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> > total 0 0 B 0 B 0 0 B 0 B 0 0 B 0 B
> >




More information about the mpich-discuss mailing list