[mpich-discuss] mpich-error stack

Rajeev Thakur thakur at mcs.anl.gov
Sat Jan 8 00:36:34 CST 2011


Could be firewall or networking issues between the two machines.

Try using the latest version of MPICH2 that uses the Hydra process manager by default and does not need mpdboot.

Rajeev

On Jan 6, 2011, at 1:38 AM, chenjie gu wrote:

> Hi all, I have a cluster with two nodes, when I boot the mpd on the single node, the software can run well,
> but when i try to link two nodes together to do the calculation, problems as follow moes, I guess it will be
> a  stack proble, though I alread set the stack to unlimited. Any suggestion will welcome,
>  
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffeeca46a0, status_array=0x7fffeeca4760) failed
> MPIDI_CH3I_Progress(150)..........: 
> MPID_nem_mpich2_blocking_recv(948): 
> MPID_nem_tcp_connpoll(1709).......: Communication error
> rank 23 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 23: killed by signal 9 
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffed7f23a0, status_array=0x7fffed7f2460) failed
> MPIDI_CH3I_Progress(150)..........: 
> MPID_nem_mpich2_blocking_recv(948): 
> MPID_nem_tcp_connpoll(1709).......: Communication error
> rank 21 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 21: killed by signal 9 
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffce96e120, status_array=0x7fffce96e1e0) failed
> MPIDI_CH3I_Progress(150)..........: 
> MPID_nem_mpich2_blocking_recv(948): 
> MPID_nem_tcp_connpoll(1709).......: Communication error
> rank 19 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 19: killed by signal 9 
> rank 17 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 17: killed by signal 9 
> rank 15 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 15: killed by signal 9 
> rank 13 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 13: killed by signal 9 
> rank 11 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 11: killed by signal 9 
> rank 9 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 9: killed by signal 11 
> rank 7 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 7: killed by signal 9 
> rank 5 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 5: killed by signal 9 
> rank 1 in job 1  node0_55860   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> 
> -- 
> Yours Regards,
> chenjie GU
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list