<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi,<div><br><div><div>Le 14 déc. 2011 à 15:19, Bernard Chambon a écrit :</div><br class="Apple-interchange-newline"><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; font-family: Courier; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div style="font-size: 16px; "><br class="Apple-interchange-newline">I encountered to kind of error :</div><div style="font-size: 16px; "><br></div><div style="font-size: 16px; "><span class="Apple-style-span" style="font-size: 12px; ">[mpiexec@ccwpge0034] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed</span></div><div style="font-size: 12px; ">[mpiexec@ccwpge0034] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status</div><div style="font-size: 12px; ">[mpiexec@ccwpge0034] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event</div><div style="font-size: 12px; ">[mpiexec@ccwpge0034] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion</div><div style="font-size: 12px; "><br></div><div style="font-size: 12px; ">OR</div><div style="font-size: 12px; "><br></div><div style="font-size: 12px; "><div>Fatal error in MPI_Send: Other MPI error, error stack:</div><span></span><div>MPI_Send(173)..........................: MPI_Send(buf=0x2b1283a75010, count=5242880, MPI_CHAR, dest=36, tag=0, MPI_COMM_WORLD) failed</div><div>MPID_nem_lmt_RndvSend(81)..............: </div><div>MPIDI_CH3_RndvSend(63).................: failure occurred while attempting to send RTS packet</div><div>MPID_nem_tcp_iStartContigMsg(298)......: </div><div>MPID_nem_tcp_connect(839)..............: </div><div>MPID_nem_tcp_get_addr_port_from_bc(515): Missing port or invalid host/port description in business card</div><div><br></div></div><div style="font-size: 17px; "><br></div><div style="font-size: 17px; ">I have no idea to investigate those failures ?</div></span><br class="Apple-interchange-newline"></blockquote></div><div><br></div><div>From my own experiment, I can add that this errors occurs when increasing the amount of tasks (-np num_tasks)</div><div>With small number of tasks (<= 32 tasks), I have no errors on more that ~10 try (=10 jobs)</div><div>with 64 tasks, I got previous errors message 3 times over 4 </div><div><br></div><div>No influence according to tasks spreaded or grouped on worker nodes (I means round_robin vs fill_up in allocation_rule of PE objet (SGE))</div><div>So , in my opinion, the influence is the number of tasks, not the number of machines)</div><div><br></div><div>Best regards</div><div>
---------------<br>Bernard CHAMBON<br>IN2P3 / CNRS<br>04 72 69 42 18<br>
</div>
<br></div></body></html>