[mpich-discuss] sge and mpich2

c cook csecook at gmail.com
Mon Jun 20 16:25:04 CDT 2011


Hello,

I was using the sge6.01u4 to runs some serial jobs for some time.

The cluster I am using has 8+1 nodes with Opteron procs.

I wanted to take advantage of this as the software I am using has a parallel
version.
So I've installed mpich2 as the parallel enviroment, I've activated the mpd
demon. when doing mpdtrace -l it sees all the 8 nodes(slave) + 1 headnode
Now when I am submitting the job using this script:

#!/bin/bash
#$ -S /bin/bash
#$ -o test.log
#$ -e test.err
#$ -N TEST_Parallel
#$ -pe mpich 2
#$ -cwd

mpiexec -n $NSLOTS siesta <input> output

the scheduler submits the job, when doing qstat I see that it's running but
 no output is produced, and this will go on for days, nothing happens, the
job will stay in the queue with status "r" forever.
the only info i get is in the test.log file is:

-catch_rsh
/home/sge6.01u4/default/spool/cn105/active_jobs/4049.1/pe_hostfile
cn105
cn102

so it seems that the scheduler did the job
nothing in the test.err, the output is created, but it's empty.
the nodes are from cn101 to cn108


The serial version works fine, this is the script I am using

#!/bin/bash
#$ -S /bin/bash
#$ -o test.log
#$ -e test.err
#$ -N TEST
#$ -cwd

 siesta <input> output


 I may have missed something during the instalation of mpich2.

Maybe some of you encountered similar problems, any ideas are welcomed.

Thanks,
Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110620/7ccc5c23/attachment-0001.htm>


More information about the mpich-discuss mailing list