[mpich-discuss] Issue running MCNPX on small cluster: Error sending static commons.
Jayesh Krishna
jayesh at mcs.anl.gov
Fri Jun 22 09:50:52 CDT 2012
Hi,
Good to know you are able to run your jobs now.
>> why would running the MPI version take over double the computational time?
I think this is a question for the MCNPX folks (It all depends on the algorithms used). Did the wallclock time reduce?
Regards,
Jayesh
----- Original Message -----
From: "Matthew Riblett" <riblem at rpi.edu>
To: mpich-discuss at mcs.anl.gov
Cc: "Jayesh Krishna" <jayesh at mcs.anl.gov>
Sent: Friday, June 22, 2012 9:33:50 AM
Subject: Re: [mpich-discuss] Issue running MCNPX on small cluster: Error sending static commons.
Jayesh,
First of all, thank you for the quick response. To answer your questions, this is a new cluster and it is still going through its paces before being put online.
I did run the cpi.exe and another MPI program from one of my colleagues across all the nodes without any incident.
When I came into work today, I restarted all of the nodes, ensured that their firewalls were down for MPICH2 (smpd.exe and mpiexec.exe)
and tried executing the MCNPX program again. As if by magic, that did the trick and the program executed without error.
However a new issue has cropped up. Running on a comparable single core, the MCNPX simulation in question takes 632 cpu minutes to complete.
In the first run through MPI, the simulation took 2875 cpu minutes over 12 cores and 16 processes (hyperthreading on four of the cores).
After disabling the hyperthreading on all of the cores, I was able to bring the total cpu time down to 1420 minutes.
I'm trying to understand why this is occurring -- why would running the MPI version take over double the computational time?
Thanks,
-- Matt
___
Matthew J. Riblett
Nuclear Engineering Class '12
Rensselaer Polytechnic Institute
Rensselaer Radiation Measurement and Dosimetry Group
American Nuclear Society, Section President
MANE Department Student Advisory Council
Email: riblem at rpi.edu
Main: +1.646.843.9596
Mobile: +1.804.245.0578
Web: http://riblem.rpians.org
On Jun 21, 2012, at 4:29 PM, Jayesh Krishna wrote:
Hi,
# Did you run cpi across all the nodes (the nodes used for running MCNPX)?
# Have you tried running other MPI programs on the cluster (Is it a new cluster?)?
Regards,
Jayesh
----- Original Message -----
From: "Matthew Riblett" < riblem at rpi.edu >
To: mpich-discuss at mcs.anl.gov
Sent: Wednesday, June 20, 2012 2:18:52 PM
Subject: [mpich-discuss] Issue running MCNPX on small cluster: Error sending static commons.
Hello,
I am attempting to run MCNPX in an MPI environment on a small cluster of computers (Dell PowerEdge servers running 64-bit Windows Server 2008 Standard).
I am using the precompiled 64-bit MPI executables from RSICC.
I've had success running the process on each of four test servers when configured to run on only one host and can escalate to run multiple processes on single hosts.
When I attempt to run the program across multiple hosts (ex: -hosts 4 Mercury-1 Mercury-2 Mercury-3 Mercury-4) it returns a fatal error:
master starting 3 by 1 subtasks 06/20/12 15:06:29
master sending static commons...
Fatal error in MPI_Send: Other MPI error, error stack
MPI_Send(173)................: MPI_Send(buf=0000000020E00000, count=236236, MPI_PACKED, dest=1, tag=4 MPI_COMM_WORLD) failed
MPIDI_CH3I_Progress(402)........:
MPID_nem_mpich2_blocking_recv(905)...:
MPID_nem_newtcp_module_poll(37)......:
MPID_nem_newtcp_module_connpoll(2656):
gen_cnting_fail_handler(1739)........: connect failed - the semaphore timeout period has expired (errno 121)
job aborted:
rank: node: exit code[: error message]
0: Mercury-1: 1: process 0 exited without calling finalize
1: Mercury-2: 123
2: Mercury-3: 123
3: Mercury-4: 123
I've looked at several of the archived posts that seemed to have similar problems, such as http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-August/010696.html .
In each case they passed the static commons sending point and got to the point where the program was sending dynamic commons.
This is a rather large simulation ~600Mb and I was curious as to whether or not its size may be playing a role in this error.
Running the cpi.exe example, the hosts communicate with one another and there is no problem in execution.
I don't think this is a firewall issue as both smpd.exe and mpiexec.exe are granted exceptions in the Windows Firewall.
Thanks in advance,
-- Matt
___
Matthew J. Riblett
Nuclear Engineering Class '12
Rensselaer Polytechnic Institute
Rensselaer Radiation Measurement and Dosimetry Group
American Nuclear Society, Section President
MANE Department Student Advisory Council
Email: riblem at rpi.edu
Main: +1.646.843.9596
Mobile: +1.804.245.0578
Web: http://riblem.rpians.org
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list