[mpich-discuss] 答复: mpich-error stack

ejoywx ejoywx at 163.com
Thu Jan 6 05:13:28 CST 2011


I think maybe you need to attempt to code your code in the nonblocking communiction.


At 2011-01-06 15:38:24,"chenjie gu" <archygu at gmail.com> wrote:

Hi all, I have a cluster with two nodes, when I boot the mpd on the single node, the software can run well,
but when i try to link two nodes together to do the calculation, problems as follow moes, I guess it will be
a  stack proble, though I alread set the stack to unlimited. Any suggestion will welcome,

 
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffeeca46a0, status_array=0x7fffeeca4760) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 23 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 23: killed by signal 9
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffed7f23a0, status_array=0x7fffed7f2460) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 21 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 21: killed by signal 9
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffce96e120, status_array=0x7fffce96e1e0) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 19 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 19: killed by signal 9
rank 17 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 17: killed by signal 9
rank 15 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9
rank 13 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 13: killed by signal 9
rank 11 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 11: killed by signal 9
rank 9 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 9: killed by signal 11
rank 7 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9
rank 5 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 5: killed by signal 9
rank 1 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

--
Yours Regards,
chenjie GU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110106/732ab87e/attachment.htm>


More information about the mpich-discuss mailing list