[mpich-discuss] rank 0 in job 5 -- caused collective abort of all ranks

Anthony Chan chan at mcs.anl.gov
Tue May 13 13:26:26 CDT 2008


It does not appear there is a MPI fortran binding issue.
It seems the fortran program gots NaN because of some
unexpected error, like roundoff or invalid memory access...
Since it is a fortran code, the compiler usually provides
array bound checker.  You may want to ask the user to enable
array bound check to see if there is obvious invalid access.
or to use a debugger to trace back where the NaN comes from.

A.Chan  
----- "Margaret Doll" <Margaret_Doll at brown.edu> wrote:

> I am a system manager of a computer cluster running  RedHat,  
> 2.6.9-42.0.2.ELsmp.  Rocks 4.3 is the cluster software.
> mpich2 has been compiled using Portland Group compilers 7-1.1.  F95  
> was enabled in  the mpich2 build.
> 
> I have a  user running a program which ends quickly with the following
>  
> error:
> 
> $ more output
>   **********warning**********
>   for atom           597  rhobar=                        NaN
>                         NaN
>   **********warning**********
>   for atom             1  rhobar=                        NaN
>                         NaN
> rank 0 in job 1  compute-0-2.local_33364   caused collective abort of 
> 
> all r
> anks
>    exit status of rank 0: killed by signal 9
> 
> The log file only contains:
> more PtNi04000PT.log
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Starting execution at Tue May 13 09:25:07 EDT 2008
> Finished at Tue May 13 09:25:08 EDT 2008
> 
> The script used contains:
> 
> /share/apps/mpich/bin/mpiexec -np  16 $HOME/ccv/meam/work/isobaric/ 
> runs/PtN
> i04000PT/ccp_meam >
> $HOME/ccv/meam/work/isobaric/runs/PtNi04000PT/output
> 
> 
> The job is submitted using
> 
> qsub ./script
> 
> The default queue contains compute-0-1, compute-0-2, compute-0-3.   
> Each compute node has eight cores.
> 
> $ mpdtrace -l
> ted.mmm.nnn.edu_49144 (128.148.nnn.nnn)
> compute-0-3.local_33222 (10.255.255.249)
> compute-0-2.local_33364 (10.255.255.253)
> compute-0-1.local_42643 (10.255.255.251)
> compute-0-0.local_58959 (10.255.255.250)
> 
> Is there a problem in the way that mpich2 was built, mpd is running or
>  
> with the Fortran 95 code?
> How do I debug the problem?
> 
> Thank you for your help.




More information about the mpich-discuss mailing list