[mpich-discuss] problems when increasing the number of processes using MPIEXEC and MCNPX

Jayesh Krishna jayesh at mcs.anl.gov
Wed Aug 24 10:48:46 CDT 2011


Hi,
 Can you try running simple MPI programs

1) like cpi (https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/examples/cpi.c)
2) and one that sends large messages (eg: https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/test/mpi/pt2pt/large_message.c, https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/test/mpi/include/mpitest.h, https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/test/mpi/util/mtest.c) 

 with the same configuration and see if it works ? This will help us debug the issue further (From the error message it looks like you have issues with large messages).

Regards,
Jayesh

----- Original Message -----
From: "Patrick Hill" <phill at radonc.wustl.edu>
To: mpich-discuss at mcs.anl.gov
Sent: Wednesday, August 24, 2011 9:34:31 AM
Subject: [mpich-discuss] problems when increasing the number of processes	using MPIEXEC and MCNPX





Hello, 



We are having a problem using MPICH2 in order to execute MCNPX in an MPI environment, but only when we increase our number of processes beyond a certain point. We have about 10 workstations in our “mini-cloud”, each running MPICH2 and SMPD version 1.3.2p1, 32-bit. 



The command line is simple, something like the following: 



mpiexec -hosts 3 10.39.16.37 2 10.39.16.65 8 10.39.16.54 8 -env DATAPATH c:\mcnpx\data -dir c:\mcnpx\phill\test mcnpx i=test1.in 



We see the normal output from MCNPX, as well as the report that it is initializing MPI processes. The problem we are having is that MPIEXEC throws the following error (A) when trying to initialize the MPI calculation: 



*** 

master starting 17 by 1 subtasks 08/19/11 17:02:54 

master sending static commons... 

master sending dynamic commons... 

Fatal error in MPI_Send: Other MPI error, error stack: 

MPI_Send(173)....................................: MPI_Send(buf=23110000, count= 

9486420, MPI_PACKED, dest=12, tag=4, MPI_COMM_WORLD) failed 

MPIDI_CH3I_Progress(353).........................: 

MPID_nem_mpich2_blocking_recv(905)...............: 

MPID_nem_newtcp_module_poll(37)..................: 

MPID_nem_newtcp_module_connpoll(2655)............: 

MPID_nem_newtcp_module_recv_success_handler(2322): 

MPID_nem_handle_pkt(587).........................: 

MPIDI_CH3_PktHandler_RndvClrToSend(253)..........: failure occurred while attemp 

ting to send message data 

MPID_nem_newtcp_iSendContig(409).................: 

MPID_nem_newtcp_iSendContig(408).................: Unable to write to a socket, 

An operation on a socket could not be performed because the system lacked suffi 

cient buffer space or because a queue was full. 

(errno 10055) 

*** 



Then this error is repeated several times, increasing with the number of processes requested: 



*** 

Fatal error in MPI_Probe: Other MPI error, error stack: 

MPI_Probe(113).......................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_ 

COMM_WORLD, status=012242D0) failed 

MPIDI_CH3I_Progress(353).............: 

MPID_nem_mpich2_blocking_recv(905)...: 

MPID_nem_newtcp_module_poll(37)......: 

MPID_nem_newtcp_module_connpoll(2655): 

gen_read_fail_handler(1145)..........: read from socket failed - The specified n 

etwork name is no longer available. 

*** 



A few notes on the situation and debugging we have attempted: 



1. We use the precompiled MPI version of MCNPX, and the error happens with both v2.6 and v2.7e 

2. This DOES NOT happen when overloading the local cpu, i.e. sending 30 processes to a local dual-core cpu. 

3. This happens both when using the –hosts option as well as when using the wmpiconfig utility to specify hosts and using the –n option on the MPIEXEC command line. 

4. The number of hosts seems not to be the cause, i.e. sending 5 processes to 2 hosts, 2 processes to 5 hosts, or 1 process to 10 hosts all work fine. 

5. This seems to be MCNPX input-file dependent, even between input files which differ only by a few numbers in certain locations. 



Could this be a communication or windows firewall issue? We are truly stumped and have had difficulty finding hints or answers in the forum. 



Best regards and many thanks in advance, 



patrick 





Patrick M. Hill, Ph.D. 

Washington University in St. Louis 

Department of Radiation Oncology 


The materials in this message are private and may contain Protected Healthcare Information. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. 

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list