[mpich-discuss] mpich-error stack

chenjie gu archygu at gmail.com
Sat Jan 8 20:24:44 CST 2011


Hi Rajeev, I think so, since each node has 24 cores, may the communication
rate if not enough, thanks a lot,
let me try the latest first.

2011/1/8 Rajeev Thakur <thakur at mcs.anl.gov>

> Could be firewall or networking issues between the two machines.
>
> Try using the latest version of MPICH2 that uses the Hydra process manager
> by default and does not need mpdboot.
>
> Rajeev
>
> On Jan 6, 2011, at 1:38 AM, chenjie gu wrote:
>
> > Hi all, I have a cluster with two nodes, when I boot the mpd on the
> single node, the software can run well,
> > but when i try to link two nodes together to do the calculation, problems
> as follow moes, I guess it will be
> > a  stack proble, though I alread set the stack to unlimited. Any
> suggestion will welcome,
> >
> > Fatal error in MPI_Waitall: Other MPI error, error stack:
> > MPI_Waitall(261)..................: MPI_Waitall(count=46,
> req_array=0x7fffeeca46a0, status_array=0x7fffeeca4760) failed
> > MPIDI_CH3I_Progress(150)..........:
> > MPID_nem_mpich2_blocking_recv(948):
> > MPID_nem_tcp_connpoll(1709).......: Communication error
> > rank 23 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 23: killed by signal 9
> > Fatal error in MPI_Waitall: Other MPI error, error stack:
> > MPI_Waitall(261)..................: MPI_Waitall(count=46,
> req_array=0x7fffed7f23a0, status_array=0x7fffed7f2460) failed
> > MPIDI_CH3I_Progress(150)..........:
> > MPID_nem_mpich2_blocking_recv(948):
> > MPID_nem_tcp_connpoll(1709).......: Communication error
> > rank 21 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 21: killed by signal 9
> > Fatal error in MPI_Waitall: Other MPI error, error stack:
> > MPI_Waitall(261)..................: MPI_Waitall(count=46,
> req_array=0x7fffce96e120, status_array=0x7fffce96e1e0) failed
> > MPIDI_CH3I_Progress(150)..........:
> > MPID_nem_mpich2_blocking_recv(948):
> > MPID_nem_tcp_connpoll(1709).......: Communication error
> > rank 19 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 19: killed by signal 9
> > rank 17 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 17: killed by signal 9
> > rank 15 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 15: killed by signal 9
> > rank 13 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 13: killed by signal 9
> > rank 11 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 11: killed by signal 9
> > rank 9 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 9: killed by signal 11
> > rank 7 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 7: killed by signal 9
> > rank 5 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 5: killed by signal 9
> > rank 1 in job 1  node0_55860   caused collective abort of all ranks
> >   exit status of rank 1: killed by signal 9
> >
> > --
> > Yours Regards,
> > chenjie GU
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Yours Regards,
chenjie GU
EEE,Nanyang Technoligical University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110109/1002f372/attachment.htm>


More information about the mpich-discuss mailing list