[mpich-discuss] problems when increasing the number of processes using MPIEXEC and MCNPX

Wed Aug 24 09:34:31 CDT 2011

Hello,

We are having a problem using MPICH2 in order to execute MCNPX in an MPI
environment, but only when we increase our number of processes beyond a
certain point. We have about 10 workstations in our "mini-cloud", each
running MPICH2 and SMPD version 1.3.2p1, 32-bit.

The command line is simple, something like the following:

mpiexec -hosts 3 10.39.16.37 2 10.39.16.65 8 10.39.16.54 8 -env DATAPATH
c:\mcnpx\data -dir c:\mcnpx\phill\test mcnpx i=test1.in

We see the normal output from MCNPX, as well as the report that it is
initializing MPI processes. The problem we are having is that MPIEXEC
throws the following error (A) when trying to initialize the MPI
calculation:

***

master starting      17 by       1 subtasks   08/19/11 17:02:54

master sending static commons...

master sending dynamic commons...

Fatal error in MPI_Send: Other MPI error, error stack:

MPI_Send(173)....................................:
MPI_Send(buf=23110000, count=

9486420, MPI_PACKED, dest=12, tag=4, MPI_COMM_WORLD) failed

MPIDI_CH3I_Progress(353).........................:

MPID_nem_mpich2_blocking_recv(905)...............:

MPID_nem_newtcp_module_poll(37)..................:

MPID_nem_newtcp_module_connpoll(2655)............:

MPID_nem_newtcp_module_recv_success_handler(2322):

MPID_nem_handle_pkt(587).........................:

MPIDI_CH3_PktHandler_RndvClrToSend(253)..........: failure occurred
while attemp

ting to send message data

MPID_nem_newtcp_iSendContig(409).................:

MPID_nem_newtcp_iSendContig(408).................:  Unable to write to a
socket,

An operation on a socket could not be performed because the system
lacked suffi

cient buffer space or because a queue was full.

(errno 10055)

***

Then this error is repeated several times, increasing with the number of
processes requested:

***

Fatal error in MPI_Probe: Other MPI error, error stack:

MPI_Probe(113).......................: MPI_Probe(src=MPI_ANY_SOURCE,
tag=4, MPI_

COMM_WORLD, status=012242D0) failed

MPIDI_CH3I_Progress(353).............:

MPID_nem_mpich2_blocking_recv(905)...:

MPID_nem_newtcp_module_poll(37)......:

MPID_nem_newtcp_module_connpoll(2655):

gen_read_fail_handler(1145)..........: read from socket failed - The
specified n

etwork name is no longer available.

***

A few notes on the situation and debugging we have attempted:

1. We use the precompiled MPI version of MCNPX, and the error happens
with both v2.6 and v2.7e

2. This DOES NOT happen when overloading the local cpu, i.e. sending 30
processes to a local dual-core cpu.

3. This happens both when using the -hosts option as well as when using
the wmpiconfig utility to specify hosts and using the -n option on the
MPIEXEC command line.

4. The number of hosts seems not to be the cause, i.e. sending 5
processes to 2 hosts, 2 processes to 5 hosts, or 1 process to 10 hosts
all work fine.

5. This seems to be MCNPX input-file dependent, even between input files
which differ only by a few numbers in certain locations.

Could this be a communication or windows firewall issue? We are truly
stumped and have had difficulty finding hints or answers in the forum.

Best regards and many thanks in advance,

patrick

Patrick M. Hill, Ph.D.

Washington University in St. Louis

Department of Radiation Oncology

The materials in this message are private and may contain Protected Healthcare Information. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110824/92f81f85/attachment.htm>