[mpich-discuss] MPICH and NWCHEM

Christopher O'Brien cjobrien at ncsu.edu
Fri May 27 14:28:39 CDT 2011


Thank you Pavan and Gus for your suggestions. I have implemented some and commented below. 

> Is yours a Rocks cluster?
> Node names such as compute-2-28 are typical from Rocks, although LSF is 
> not a Rocks thing.
Yes it uses Platform Rocks. I was under the impression that LSF was the queuing system used by Rocks, but I was wrong.
> 
> If it is Rocks, I'd suggest to install everything you need (MPICH2, 
> NEWCHEM, etc) in subdirectories of /share/apps, if you have permission 
> to write there, or in subdirectories of your home directory.
> These directories exist physically on the head node (frontend node in 
> Rocks parlance).
> Both are exported from the head node and NFS-mounted on all compute
> nodes in a Rocks cluster.
> Therefore, libraries and other software installed on those locations
> are reachable by the compute nodes.
True.
> The error message seems to say that blaunch is missing on that 
> particular compute node.
> Can you launch serial processes (say a script with 'hostname;pwd')
> via blaunch on any node?
No, blaunch does not exist on my system, either on the head node or on the local drive of any node. I can run in serial by directly executing the program built with mpich2, but not though the 'bsub' command.
> To test MPICH2, try the
> very simple cpi.c program in the MPICH2 'examples' source directory.
> Compile it with mpicc from MPICH2.
> This will tell you if your MPICH2 is functional.
> This may clear the way before you try NEWCHEM.
This works on one node (2 processors). It will not work if you ask for more nodes as it gives the "can't find blaunch" error.
> If the cluster is not being used by others, you can try to bypass
> LSF and launch your MPICH2 jobs directly through mpiexec,
> by providing a lists of hosts (-hosts), or a config file (-configfile),
> and using ssh as a launcher (-launcher), maybe setting also the working 
> directory (-wdir).
> See mpiexec.hydra -help for details.
I did more research using your suggestion about changing launchers, and I actually had success. First, I compiled MPICH2 using "--with-pm=hydra" and added "-launcher ssh" to the mpirun command line (on my machine, MPICH2 links mpirun and mpiexec to mpich_hydra). I was able to use a job submission script included below. Using MPICH2 fixed the strange behavior of the uneven distribution of processes across nodes. However, it did not solve another problem.

It might NWCHEM's fault, but my calculations will not converge when requesting only 2 processors. It might be an NWCHEM problem, but it is odd that running in serial or with a larger numbers of processors works fine.  Does anyone have any further suggestions?

Best Regards,
Chris


#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 4
#BSUB -J ni2-6-31
#BSUB -m rack2
NPROC=4
RUNDIR=`pwd`
PROG=nwchem
INP=ni2.nw
rm nodes
rm nodes_lsf
for host in $LSB_HOSTS
do
echo $host >> $RUNDIR/nodes_lsf
done
echo "----------------------------------------------------------------"
echo " Invoking localnew for $PROG on $LSB_HOSTS"
echo "----------------------------------------------------------------"
mpirun -launcher ssh -n $NPROC -machinefile $RUNDIR/nodes_lsf $RUNDIR/$PROG $INP >out


===================================================================
Christopher J. O'Brien
cjobrien at ncsu.edu
https://sites.google.com/a/ncsu.edu/cjobrien/

Ph.D. Candidate
Computational Materials Group
Department of Materials Science & Engineering
North Carolina State University
__________________________________________________________________
Please send all documents in PDF. 
For Word documents: Please use the 'Save as PDF' option before sending.
===================================================================



More information about the mpich-discuss mailing list