[mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu

Rajeev Thakur thakur at mcs.anl.gov
Wed Feb 4 12:32:05 CST 2009


Can you just try withou the --enable-threads=multiple option? It is not
needed. The default option is --enable-threads=runtime, which is more
efficient. I am not sure if it will make any difference, but worth a try.

It is also possible that Torque isn't reporting the CPU time correctly. 

Rajeev

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
> Ellen Fitzpatrick
> Sent: Wednesday, February 04, 2009 12:13 PM
> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> Subject: Re: [mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu
> 
> Thanks, I recompiled with nemesis with additional configs below. 
> $ ./configure --prefix=/usr/local/mpich2.nemesis --enable-cxx 
> --enable-threads=multiple --with-thread-package=posix --enable-shared 
> --enable-sharedlibs=gcc --with-device=ch3:nemesis 
> --with-python=/usr/bin/python
> 
> Ran my mpi jobs on a smaller dataset.  Run time ~17 minutes with 8 
> seconds of cpu usage....
> 
> The job runtime/cpu usage with the nemesis configured:
> Session:        13944
> Limits:         ncpus=4,neednodes=1,nodes=1,walltime=48:00:00
> Resources:      
> cput=00:00:08,mem=9960kb,vmem=279864kb,walltime=00:17:32
> 
> 
> Basically, the same issue, long run times, with minimal cpu usage.
> 
> 
> Rajeev Thakur wrote:
> > Hmm... Not sure what is going on here. Is your job expected 
> to take 15
> > hours? You may also want to try using the Nemesis 
> communication channel in
> > MPICH2, which will use shared memory for communication 
> within a node and TCP
> > (or other network) across nodes. Configure with 
> --with-device=ch3:nemesis.
> >
> > Rajeev
> >
> >
> >   
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov 
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
> >> Ellen Fitzpatrick
> >> Sent: Wednesday, February 04, 2009 10:46 AM
> >> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
> >> Subject: [mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu
> >>
> >> I have a dual-dual core Opteron cluster running Cento5, 
> torque-2.3.6, 
> >> maui-3.2.6p21, mpich2-1.0.8(64bit) and a docking program, parallel 
> >> dock6.  I installed dock6 serial as 32-bit, then installed dock6 
> >> parallel as 32-bit.
> >> I have configured my queues and scripts to run the dock 
> mpi jobs and 
> >> they do run to completion without errors.
> >>
> >> The problem I am seeing is that my mpi job is running for 
> a total of 
> >> 15hours, but is using only ~ 7minutes of cputime.
> >> outfile
> >> Limits:         ncpus=4,neednodes=1,nodes=1,walltime=48:00:00
> >> Resources:      
> >> cput=00:06:55,mem=9964kb,vmem=279836kb,walltime=15:12:46
> >>
> >> When the job is running, I log into the node, and can see the 
> >> cpu's at 
> >> 100%, so it is not sitting idle and there is not an nfs 
> >> traffic to speak of.
> >>
> >> Anyone run into this issue before?  Is this an mpi issue?
> >>
> >> -- 
> >> Thanks
> >> Mary Ellen
> >>
> >>
> >>     
> >
> >
> >   
> 
> -- 
> Thanks
> Mary Ellen
> 
> 



More information about the mpich-discuss mailing list