[mpich-discuss] mpich-error stack

chenjie gu archygu at gmail.com
Thu Jan 6 01:38:24 CST 2011


Hi all, I have a cluster with two nodes, when I boot the mpd on the single
node, the software can run well,
but when i try to link two nodes together to do the calculation, problems as
follow moes, I guess it will be
a  stack proble, though I alread set the stack to unlimited. Any suggestion
will welcome,

Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46,
req_array=0x7fffeeca46a0, status_array=0x7fffeeca4760) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 23 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 23: killed by signal 9
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46,
req_array=0x7fffed7f23a0, status_array=0x7fffed7f2460) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 21 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 21: killed by signal 9
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46,
req_array=0x7fffce96e120, status_array=0x7fffce96e1e0) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 19 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 19: killed by signal 9
rank 17 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 17: killed by signal 9
rank 15 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9
rank 13 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 13: killed by signal 9
rank 11 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 11: killed by signal 9
rank 9 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 9: killed by signal 11
rank 7 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9
rank 5 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 5: killed by signal 9
rank 1 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

-- 
Yours Regards,
chenjie GU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110106/cf1ca201/attachment.htm>


More information about the mpich-discuss mailing list