[mpich-discuss] mpich2-1.4.1p1 stability (thru Grid Engine)
Bernard Chambon
chambon at cc.in2p3.fr
Wed Dec 14 14:38:36 CST 2011
Hi,
Le 14 déc. 2011 à 15:19, Bernard Chambon a écrit :
>
> I encountered to kind of error :
>
> [mpiexec at ccwpge0034] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
> [mpiexec at ccwpge0034] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at ccwpge0034] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
> [mpiexec at ccwpge0034] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>
> OR
>
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173)..........................: MPI_Send(buf=0x2b1283a75010, count=5242880, MPI_CHAR, dest=36, tag=0, MPI_COMM_WORLD) failed
> MPID_nem_lmt_RndvSend(81)..............:
> MPIDI_CH3_RndvSend(63).................: failure occurred while attempting to send RTS packet
> MPID_nem_tcp_iStartContigMsg(298)......:
> MPID_nem_tcp_connect(839)..............:
> MPID_nem_tcp_get_addr_port_from_bc(515): Missing port or invalid host/port description in business card
>
>
> I have no idea to investigate those failures ?
>
From my own experiment, I can add that this errors occurs when increasing the amount of tasks (-np num_tasks)
With small number of tasks (<= 32 tasks), I have no errors on more that ~10 try (=10 jobs)
with 64 tasks, I got previous errors message 3 times over 4
No influence according to tasks spreaded or grouped on worker nodes (I means round_robin vs fill_up in allocation_rule of PE objet (SGE))
So , in my opinion, the influence is the number of tasks, not the number of machines)
Best regards
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111214/a447e58b/attachment.htm>
More information about the mpich-discuss
mailing list