[mpich-discuss] MPICH and NWCHEM

Anthony Chan chan at mcs.anl.gov
Sat May 21 19:41:59 CDT 2011


Chris,

Neither MPICH-1 nor MPICH2 has version 1.4.6.  You need to be clear
what version of MPICH* you are using.  If you are using the old
MPICH-1, please upgrade to the latest MPICH2.

A.Chan



----- Original Message -----
> I am very concerned about the strange distribution of processes
> spawned while running NWCHEM. NWCHEM developers think the problem is
> related to mpich.
> 
> I am using a cluster with 2 Xenons per node with 2GB total memory. I
> used GCC 4.6 and GNU's MPICH 1.4.6 to compile NWCHEM-6.0. NWCHEM
> passes it's tests and is compiled correctly. It even obtains
> reasonable speedup in cpu time, but quickly becomes bogged down in
> communication time with increasing number of processes. However,
> NWCHEM is unbelievably slow compared to benchmarks I have seen, even
> considering that my cluster uses standard ethernet interconnects. In
> one case, I find:
> PROCS CPU Time (s) Wall Time (s)
> 1 405.8 410.4
> 2 196.1 230.2
> 4 153.9 304.1
> 8 85.6 403.8
> 16 62.2 595.2
> I'm not too surprised that I don't need many processors, the job is
> really small. The large communication time made me suspicious and
> further investigation showed a strange distribution of processes.
> 
> For example, I have posted two screen captures on my site (see below),
> one for each node. In short, when I use my lsf job submission system
> (bsub command) I find that the processes are not evenly distributed
> across nodes. For example, if I request 4 processors and I have two
> Xenons per node, I should be using 2 nodes. In fact, NWCHEM uses 2
> nodes and reports:
> > ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP
> > Sockets'.
> However, inspecting the actual number of running processes on each of
> the two nodes:
> master node--------2 process
> slave node--------7 process
> I contacted the NWCHEM developers who replied "As to the extra
> processes, these should be some extra thread processes internal to
> NWChem that indeed do not do much (they can be woken up in certain
> communication operations)." But, why are there different numbers of
> auxiliary processes on each node?
> 
> To avoid filling up everyone's inbox, I posted screen shots of the
> running processes to:
> https://sites.google.com/a/ncsu.edu/cjobrien/file-exchange
> 
> Regards,
> Chris O'Brien
> 
> ===================================================================
> Christopher J. O'Brien
> cjobrien at ncsu.edu
> https://sites.google.com/a/ncsu.edu/cjobrien/
> 
> Ph.D. Candidate
> Computational Materials Group
> Department of Materials Science & Engineering
> North Carolina State University
> __________________________________________________________________
> Please send all documents in PDF.
> For Word documents: Please use the 'Save as PDF' option before
> sending.
> ===================================================================
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list