[mpich-discuss] Issue running MCNPX on small cluster: Error sending static commons.

Matthew Riblett riblem at rpi.edu
Fri Jun 22 09:33:50 CDT 2012


Jayesh,

First of all, thank you for the quick response.  To answer your questions, this is a new cluster and it is still going through its paces before being put online.  
I did run the cpi.exe and another MPI program from one of my colleagues across all the nodes without any incident.

When I came into work today, I restarted all of the nodes, ensured that their firewalls were down for MPICH2 (smpd.exe and mpiexec.exe)
and tried executing the MCNPX program again.  As if by magic, that did the trick and the program executed without error.

However a new issue has cropped up.  Running on a comparable single core, the MCNPX simulation in question takes 632 cpu minutes to complete.
In the first run through MPI, the simulation took 2875 cpu minutes over 12 cores and 16 processes (hyperthreading on four of the cores).  
After disabling the hyperthreading on all of the cores, I was able to bring the total cpu time down to 1420 minutes.
I'm trying to understand why this is occurring -- why would running the MPI version take over double the computational time?

Thanks,

-- Matt

___
Matthew J. Riblett
Nuclear Engineering Class '12
Rensselaer Polytechnic Institute
Rensselaer Radiation Measurement and Dosimetry Group
American Nuclear Society, Section President
MANE Department Student Advisory Council

Email:    riblem at rpi.edu
Main:     +1.646.843.9596
Mobile:  +1.804.245.0578
Web:      http://riblem.rpians.org





On Jun 21, 2012, at 4:29 PM, Jayesh Krishna wrote:

> Hi,
> 
> # Did you run cpi across all the nodes (the nodes used for running MCNPX)?
> # Have you tried running other MPI programs on the cluster (Is it a new cluster?)?
> 
> Regards,
> Jayesh
> 
> ----- Original Message -----
> From: "Matthew Riblett" <riblem at rpi.edu>
> To: mpich-discuss at mcs.anl.gov
> Sent: Wednesday, June 20, 2012 2:18:52 PM
> Subject: [mpich-discuss] Issue running MCNPX on small cluster: Error sending	static commons.
> 
> 
> Hello, 
> 
> 
> I am attempting to run MCNPX in an MPI environment on a small cluster of computers (Dell PowerEdge servers running 64-bit Windows Server 2008 Standard). 
> I am using the precompiled 64-bit MPI executables from RSICC. 
> I've had success running the process on each of four test servers when configured to run on only one host and can escalate to run multiple processes on single hosts. 
> When I attempt to run the program across multiple hosts (ex: -hosts 4 Mercury-1 Mercury-2 Mercury-3 Mercury-4) it returns a fatal error: 
> 
> 
> master starting 3 by 1 subtasks 06/20/12 15:06:29 
> master sending static commons... 
> Fatal error in MPI_Send: Other MPI error, error stack 
> MPI_Send(173)................: MPI_Send(buf=0000000020E00000, count=236236, MPI_PACKED, dest=1, tag=4 MPI_COMM_WORLD) failed 
> MPIDI_CH3I_Progress(402)........: 
> MPID_nem_mpich2_blocking_recv(905)...: 
> MPID_nem_newtcp_module_poll(37)......: 
> MPID_nem_newtcp_module_connpoll(2656): 
> gen_cnting_fail_handler(1739)........: connect failed - the semaphore timeout period has expired (errno 121) 
> 
> 
> job aborted: 
> rank: node: exit code[: error message] 
> 0: Mercury-1: 1: process 0 exited without calling finalize 
> 1: Mercury-2: 123 
> 2: Mercury-3: 123 
> 3: Mercury-4: 123 
> 
> 
> I've looked at several of the archived posts that seemed to have similar problems, such as http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-August/010696.html . 
> In each case they passed the static commons sending point and got to the point where the program was sending dynamic commons. 
> 
> 
> This is a rather large simulation ~600Mb and I was curious as to whether or not its size may be playing a role in this error. 
> Running the cpi.exe example, the hosts communicate with one another and there is no problem in execution. 
> 
> 
> I don't think this is a firewall issue as both smpd.exe and mpiexec.exe are granted exceptions in the Windows Firewall. 
> 
> 
> Thanks in advance, 
> 
> 
> -- Matt 
> 
> 
> 
> 
> ___ 
> Matthew J. Riblett 
> Nuclear Engineering Class '12 
> Rensselaer Polytechnic Institute 
> Rensselaer Radiation Measurement and Dosimetry Group 
> American Nuclear Society, Section President 
> MANE Department Student Advisory Council 
> 
> Email: riblem at rpi.edu 
> Main: +1.646.843.9596 
> Mobile: +1.804.245.0578 
> Web: http://riblem.rpians.org 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120622/4315dd21/attachment-0001.html>


More information about the mpich-discuss mailing list