[mpich-discuss] mpich2-1.4.1p1 stability (thru Grid Engine)

Bernard Chambon bernard.chambon at cc.in2p3.fr
Wed Dec 14 08:19:46 CST 2011


Hello

Using mpich2-1.4.1p1 since a few day, I appreciate integration with Grid Engine
I have run many jobs, but from time to time (1 ovr 10 times) I got some failures

My basic test code is an infinite loop with a master task sending a message (10 MB) to slave tasks

10 MB is an arbitrary value, what is the message size limit with MPI_Send(message, …)  ?

I encountered to kind of error :

[mpiexec at ccwpge0034] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec at ccwpge0034] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at ccwpge0034] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at ccwpge0034] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion

OR

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)..........................: MPI_Send(buf=0x2b1283a75010, count=5242880, MPI_CHAR, dest=36, tag=0, MPI_COMM_WORLD) failed
MPID_nem_lmt_RndvSend(81)..............: 
MPIDI_CH3_RndvSend(63).................: failure occurred while attempting to send RTS packet
MPID_nem_tcp_iStartContigMsg(298)......: 
MPID_nem_tcp_connect(839)..............: 
MPID_nem_tcp_get_addr_port_from_bc(515): Missing port or invalid host/port description in business card


I have no idea to investigate those failures ?
	
Best regards

---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111214/c7b2b493/attachment.htm>


More information about the mpich-discuss mailing list