[mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Wed Feb 4 13:45:09 CST 2009


Recompiled without the --enable-threads=muliple.
Same result.   Checking on the torque reporting cpu correctly.




Rajeev Thakur wrote:
> Can you just try withou the --enable-threads=multiple option? It is not
> needed. The default option is --enable-threads=runtime, which is more
> efficient. I am not sure if it will make any difference, but worth a try.
>
> It is also possible that Torque isn't reporting the CPU time correctly. 
>
> Rajeev
>
>   
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>> Ellen Fitzpatrick
>> Sent: Wednesday, February 04, 2009 12:13 PM
>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>> Subject: Re: [mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu
>>
>> Thanks, I recompiled with nemesis with additional configs below. 
>> $ ./configure --prefix=/usr/local/mpich2.nemesis --enable-cxx 
>> --enable-threads=multiple --with-thread-package=posix --enable-shared 
>> --enable-sharedlibs=gcc --with-device=ch3:nemesis 
>> --with-python=/usr/bin/python
>>
>> Ran my mpi jobs on a smaller dataset.  Run time ~17 minutes with 8 
>> seconds of cpu usage....
>>
>> The job runtime/cpu usage with the nemesis configured:
>> Session:        13944
>> Limits:         ncpus=4,neednodes=1,nodes=1,walltime=48:00:00
>> Resources:      
>> cput=00:00:08,mem=9960kb,vmem=279864kb,walltime=00:17:32
>>
>>
>> Basically, the same issue, long run times, with minimal cpu usage.
>>
>>
>> Rajeev Thakur wrote:
>>     
>>> Hmm... Not sure what is going on here. Is your job expected 
>>>       
>> to take 15
>>     
>>> hours? You may also want to try using the Nemesis 
>>>       
>> communication channel in
>>     
>>> MPICH2, which will use shared memory for communication 
>>>       
>> within a node and TCP
>>     
>>> (or other network) across nodes. Configure with 
>>>       
>> --with-device=ch3:nemesis.
>>     
>>> Rajeev
>>>
>>>
>>>   
>>>       
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary 
>>>> Ellen Fitzpatrick
>>>> Sent: Wednesday, February 04, 2009 10:46 AM
>>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>>> Subject: [mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu
>>>>
>>>> I have a dual-dual core Opteron cluster running Cento5, 
>>>>         
>> torque-2.3.6, 
>>     
>>>> maui-3.2.6p21, mpich2-1.0.8(64bit) and a docking program, parallel 
>>>> dock6.  I installed dock6 serial as 32-bit, then installed dock6 
>>>> parallel as 32-bit.
>>>> I have configured my queues and scripts to run the dock 
>>>>         
>> mpi jobs and 
>>     
>>>> they do run to completion without errors.
>>>>
>>>> The problem I am seeing is that my mpi job is running for 
>>>>         
>> a total of 
>>     
>>>> 15hours, but is using only ~ 7minutes of cputime.
>>>> outfile
>>>> Limits:         ncpus=4,neednodes=1,nodes=1,walltime=48:00:00
>>>> Resources:      
>>>> cput=00:06:55,mem=9964kb,vmem=279836kb,walltime=15:12:46
>>>>
>>>> When the job is running, I log into the node, and can see the 
>>>> cpu's at 
>>>> 100%, so it is not sitting idle and there is not an nfs 
>>>> traffic to speak of.
>>>>
>>>> Anyone run into this issue before?  Is this an mpi issue?
>>>>
>>>> -- 
>>>> Thanks
>>>> Mary Ellen
>>>>
>>>>
>>>>     
>>>>         
>>>   
>>>       
>> -- 
>> Thanks
>> Mary Ellen
>>
>>
>>     
>
>
>   

-- 
Thanks
Mary Ellen



More information about the mpich-discuss mailing list