[mpich-discuss] MPICH and NWCHEM

Gus Correa gus at ldeo.columbia.edu
Fri May 27 15:04:45 CDT 2011


Hi Christopher

Glad that you could get MPICH2 to work, despite all the
hurdles in your cluster.

I suppose now you can run cpi.c across any number of nodes,
with the clever mix of LSF #BSUB directives
and the MPICH2 ssh launcher you concocted, right?
This would show that your MPICH2 installation is sane and functional.

Your cluster may be some derivative of the basic Rocks setup.
In the vanilla flavor Rocks you can choose between
SGE or Torque (free) queuing systems, not LSF (proprietary).
I think some companies 'add value' to Rocks, with proprietary queuing 
systems (such as LSF), additional packages, etc,
and this may be the case of your cluster.

I am afraid I cannot help with NWCHEM.
I don't know anything about computational Chemistry.
We're a climate, atmosphere, oceans, Earth science shop here.

You may be luckier if you ask the specific question about 
non-convergence with 2 processors in the NWCHEM list,
if they have a mailing list.

Indeed, a properly written MPI program should give the same
results (to machine precision or close to it),
regardless of the number of processes used.
That it converges with 1 and with >2 processes,
but fails with 2 processes, is really weird.

One thing that comes to mind is if there is some parameter in
your input file (ni2.nw) that needs to be adjusted according to the 
number of processes that you use.
Did you look into the NWCHEM documentation for that file and parameters?

Another possibility is whether you built NWCHEM from scratch,
after you reinstalled  MPICH2.  There is probably a 'make cleanall', or 
'make distclean' type of command to get rid of any residual object files 
and libraries in the NWCHEM tree, that may be hanging around from
since the old days.
If not, just fetch the tarball again, and start fresh.

Anyway, there is a number of computational chemists who do
read and post on *this* list,
who may have a better idea of what is going on with NWCHEM,
and hopefully one of them is reading your messages ...

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Christopher O'Brien wrote:
> Thank you Pavan and Gus for your suggestions. I have implemented some and commented below. 
> 
>> Is yours a Rocks cluster?
>> Node names such as compute-2-28 are typical from Rocks, although LSF is 
>> not a Rocks thing.
> Yes it uses Platform Rocks. I was under the impression that LSF was the queuing system used by Rocks, but I was wrong.
>> If it is Rocks, I'd suggest to install everything you need (MPICH2, 
>> NEWCHEM, etc) in subdirectories of /share/apps, if you have permission 
>> to write there, or in subdirectories of your home directory.
>> These directories exist physically on the head node (frontend node in 
>> Rocks parlance).
>> Both are exported from the head node and NFS-mounted on all compute
>> nodes in a Rocks cluster.
>> Therefore, libraries and other software installed on those locations
>> are reachable by the compute nodes.
> True.
>> The error message seems to say that blaunch is missing on that 
>> particular compute node.
>> Can you launch serial processes (say a script with 'hostname;pwd')
>> via blaunch on any node?
> No, blaunch does not exist on my system, either on the head node or on the local drive of any node. I can run in serial by directly executing the program built with mpich2, but not though the 'bsub' command.
>> To test MPICH2, try the
>> very simple cpi.c program in the MPICH2 'examples' source directory.
>> Compile it with mpicc from MPICH2.
>> This will tell you if your MPICH2 is functional.
>> This may clear the way before you try NEWCHEM.
> This works on one node (2 processors). It will not work if you ask for more nodes as it gives the "can't find blaunch" error.
>> If the cluster is not being used by others, you can try to bypass
>> LSF and launch your MPICH2 jobs directly through mpiexec,
>> by providing a lists of hosts (-hosts), or a config file (-configfile),
>> and using ssh as a launcher (-launcher), maybe setting also the working 
>> directory (-wdir).
>> See mpiexec.hydra -help for details.
> I did more research using your suggestion about changing launchers, 
and I actually had success. First, I compiled MPICH2 using
"--with-pm=hydra" and added "-launcher ssh" to the mpirun
command line (on my machine, MPICH2 links mpirun and mpiexec
to mpich_hydra). I was able to use a job submission script included below.
Using MPICH2 fixed the strange behavior of the uneven distribution of
processes across nodes. However, it did not solve another problem.
> 
> It might NWCHEM's fault, but my calculations will not converge when 
requesting only 2 processors. It might be an NWCHEM problem,
but it is odd that running in serial or with a larger numbers of
processors works fine.  Does anyone have any further suggestions?
> 
> Best Regards,
> Chris
> 
> 
> #BSUB -o out.%J
> #BSUB -e err.%J
> #BSUB -n 4
> #BSUB -J ni2-6-31
> #BSUB -m rack2
> NPROC=4
> RUNDIR=`pwd`
> PROG=nwchem
> INP=ni2.nw
> rm nodes
> rm nodes_lsf
> for host in $LSB_HOSTS
> do
> echo $host >> $RUNDIR/nodes_lsf
> done
> echo "----------------------------------------------------------------"
> echo " Invoking localnew for $PROG on $LSB_HOSTS"
> echo "----------------------------------------------------------------"
> mpirun -launcher ssh -n $NPROC -machinefile $RUNDIR/nodes_lsf $RUNDIR/$PROG $INP >out
> 
> 
> ===================================================================
> Christopher J. O'Brien
> cjobrien at ncsu.edu
> https://sites.google.com/a/ncsu.edu/cjobrien/
> 
> Ph.D. Candidate
> Computational Materials Group
> Department of Materials Science & Engineering
> North Carolina State University
> __________________________________________________________________
> Please send all documents in PDF. 
> For Word documents: Please use the 'Save as PDF' option before sending.
> ===================================================================
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list