[mpich-discuss] 回复: Fatal error in PMPI_Bcast: Other MPI error, errorstack:

游手好闲 66152764 at qq.com
Thu Jul 28 19:07:17 CDT 2011


I debug the error by mpich2-1.2.1p1 and mpd.
 It is very odd just because some nodes set DNS address.
  
 When I unset DNS ,It works. Thank you for replying.
  
   
  
  ------------------ 原始邮件 ------------------
  发件人: "Pavan Balaji"<balaji at mcs.anl.gov>;
 发送时间: 2011年7月26日(星期二) 晚上9:35
 收件人: "mpich-discuss"<mpich-discuss at mcs.anl.gov>; 
 抄送: "游手好闲"<66152764 at qq.com>; 
 主题: Re: [mpich-discuss] Fatal error in PMPI_Bcast: Other MPI error, errorstack:

  

Did you do the checks listed on this FAQ entry?

http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes

  -- Pavan

On 07/26/2011 01:55 AM, 游手好闲 wrote:
> Hi ,
> My hosts:
> hksbs-s13.com:8
> hksbs-s11.com:8
> When i run in one node,it is ok.
> [root at hksbs-s13 examples_collchk]# mpiexec -f hosts -n 8 ./time_bcast_nochk
> time taken by 1X1 MPI_Bcast() at rank 0 = 0.000005
> time taken by 1X1 MPI_Bcast() at rank 1 = 0.000002
> time taken by 1X1 MPI_Bcast() at rank 2 = 0.000003
> time taken by 1X1 MPI_Bcast() at rank 3 = 0.000002
> time taken by 1X1 MPI_Bcast() at rank 4 = 0.000004
> time taken by 1X1 MPI_Bcast() at rank 5 = 0.000002
> time taken by 1X1 MPI_Bcast() at rank 6 = 0.000003
> time taken by 1X1 MPI_Bcast() at rank 7 = 0.000002
> but when i connect to other node, it failed
> [root at hksbs-s13 examples_logging]# mpiexec -f hosts -n 9 ./srtest
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1478)......................: MPI_Bcast(buf=0x16fc2aa8,
> count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Bcast_impl(1321).................:
> MPIR_Bcast_intra(1119)................:
> MPIR_Bcast_scatter_ring_allgather(961):
> MPIR_Bcast_binomial(213)..............: Failure during collective
> MPIR_Bcast_scatter_ring_allgather(952):
> MPIR_Bcast_binomial(189)..............:
> MPIC_Send(63).........................:
> MPIDI_EagerContigShortSend(262).......: failure occurred while
> attempting to send an eager message
> MPIDI_CH3_iStartMsg(36)...............: Communication error with rank 8
> when i ssh the other node, for example
>
> [root at hksbs-s13 examples_logging]# ssh hksbs-s11.com
> Last login: Tue Jul 26 15:45:22 2011 from 10.33.15.233
> [root at hksbs-s11 ~]#
> it works.
> How can check the reason?
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110729/fdf9b56a/attachment.htm>


More information about the mpich-discuss mailing list