[mpich-discuss] Issue running MCNPX on small cluster: Error sending static commons.

Jayesh Krishna jayesh at mcs.anl.gov
Thu Jun 21 15:29:54 CDT 2012


Hi,

# Did you run cpi across all the nodes (the nodes used for running MCNPX)?
# Have you tried running other MPI programs on the cluster (Is it a new cluster?)?

Regards,
Jayesh

----- Original Message -----
From: "Matthew Riblett" <riblem at rpi.edu>
To: mpich-discuss at mcs.anl.gov
Sent: Wednesday, June 20, 2012 2:18:52 PM
Subject: [mpich-discuss] Issue running MCNPX on small cluster: Error sending	static commons.


Hello, 


I am attempting to run MCNPX in an MPI environment on a small cluster of computers (Dell PowerEdge servers running 64-bit Windows Server 2008 Standard). 
I am using the precompiled 64-bit MPI executables from RSICC. 
I've had success running the process on each of four test servers when configured to run on only one host and can escalate to run multiple processes on single hosts. 
When I attempt to run the program across multiple hosts (ex: -hosts 4 Mercury-1 Mercury-2 Mercury-3 Mercury-4) it returns a fatal error: 


master starting 3 by 1 subtasks 06/20/12 15:06:29 
master sending static commons... 
Fatal error in MPI_Send: Other MPI error, error stack 
MPI_Send(173)................: MPI_Send(buf=0000000020E00000, count=236236, MPI_PACKED, dest=1, tag=4 MPI_COMM_WORLD) failed 
MPIDI_CH3I_Progress(402)........: 
MPID_nem_mpich2_blocking_recv(905)...: 
MPID_nem_newtcp_module_poll(37)......: 
MPID_nem_newtcp_module_connpoll(2656): 
gen_cnting_fail_handler(1739)........: connect failed - the semaphore timeout period has expired (errno 121) 


job aborted: 
rank: node: exit code[: error message] 
0: Mercury-1: 1: process 0 exited without calling finalize 
1: Mercury-2: 123 
2: Mercury-3: 123 
3: Mercury-4: 123 


I've looked at several of the archived posts that seemed to have similar problems, such as http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-August/010696.html . 
In each case they passed the static commons sending point and got to the point where the program was sending dynamic commons. 


This is a rather large simulation ~600Mb and I was curious as to whether or not its size may be playing a role in this error. 
Running the cpi.exe example, the hosts communicate with one another and there is no problem in execution. 


I don't think this is a firewall issue as both smpd.exe and mpiexec.exe are granted exceptions in the Windows Firewall. 


Thanks in advance, 


-- Matt 




___ 
Matthew J. Riblett 
Nuclear Engineering Class '12 
Rensselaer Polytechnic Institute 
Rensselaer Radiation Measurement and Dosimetry Group 
American Nuclear Society, Section President 
MANE Department Student Advisory Council 

Email: riblem at rpi.edu 
Main: +1.646.843.9596 
Mobile: +1.804.245.0578 
Web: http://riblem.rpians.org 










_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list