[mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu

Gus Correa gus at ldeo.columbia.edu
Thu Feb 5 10:29:03 CST 2009


Hello Mary Ellen and list

My experience (with admittedly old versions of Torque,
actually PBS), is that the CPU time reported is
not reliable, often very close to zero.
Monitoring the cpu with top, as you did, may give a
better indication of use.

On that particular cluster of ours,
the mpich's mpirun from Myricom/Myrinet
is not tightly integrated with Torque.
So, I would guess the mpiexec flavor you use,
whether from MPICH2 or from Torque/OSC,
also matters for the CPU use reported by Torque.

See:
http://www.osc.edu/~pw/mpiexec/index.php
(under "Description - Reasons to use ..." )

My two cents.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Mary Ellen Fitzpatrick wrote:
> Recompiled without the --enable-threads=muliple.
> Same result.   Checking on the torque reporting cpu correctly.
> 
> 
> 
> 
> Rajeev Thakur wrote:
>> Can you just try withou the --enable-threads=multiple option? It is not
>> needed. The default option is --enable-threads=runtime, which is more
>> efficient. I am not sure if it will make any difference, but worth a try.
>>
>> It is also possible that Torque isn't reporting the CPU time correctly.
>> Rajeev
>>
>>  
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary Ellen 
>>> Fitzpatrick
>>> Sent: Wednesday, February 04, 2009 12:13 PM
>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>> Subject: Re: [mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu
>>>
>>> Thanks, I recompiled with nemesis with additional configs below. $ 
>>> ./configure --prefix=/usr/local/mpich2.nemesis --enable-cxx 
>>> --enable-threads=multiple --with-thread-package=posix --enable-shared 
>>> --enable-sharedlibs=gcc --with-device=ch3:nemesis 
>>> --with-python=/usr/bin/python
>>>
>>> Ran my mpi jobs on a smaller dataset.  Run time ~17 minutes with 8 
>>> seconds of cpu usage....
>>>
>>> The job runtime/cpu usage with the nemesis configured:
>>> Session:        13944
>>> Limits:         ncpus=4,neednodes=1,nodes=1,walltime=48:00:00
>>> Resources:      cput=00:00:08,mem=9960kb,vmem=279864kb,walltime=00:17:32
>>>
>>>
>>> Basically, the same issue, long run times, with minimal cpu usage.
>>>
>>>
>>> Rajeev Thakur wrote:
>>>    
>>>> Hmm... Not sure what is going on here. Is your job expected       
>>> to take 15
>>>    
>>>> hours? You may also want to try using the Nemesis       
>>> communication channel in
>>>    
>>>> MPICH2, which will use shared memory for communication       
>>> within a node and TCP
>>>    
>>>> (or other network) across nodes. Configure with       
>>> --with-device=ch3:nemesis.
>>>    
>>>> Rajeev
>>>>
>>>>
>>>>        
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Mary Ellen 
>>>>> Fitzpatrick
>>>>> Sent: Wednesday, February 04, 2009 10:46 AM
>>>>> To: mpich-discuss at mcs.anl.gov; Mary Ellen Fitzpatrick
>>>>> Subject: [mpich-discuss] mpi runs for 15 hourrs, using 7 mins cpu
>>>>>
>>>>> I have a dual-dual core Opteron cluster running Cento5,         
>>> torque-2.3.6,    
>>>>> maui-3.2.6p21, mpich2-1.0.8(64bit) and a docking program, parallel 
>>>>> dock6.  I installed dock6 serial as 32-bit, then installed dock6 
>>>>> parallel as 32-bit.
>>>>> I have configured my queues and scripts to run the dock         
>>> mpi jobs and    
>>>>> they do run to completion without errors.
>>>>>
>>>>> The problem I am seeing is that my mpi job is running for         
>>> a total of    
>>>>> 15hours, but is using only ~ 7minutes of cputime.
>>>>> outfile
>>>>> Limits:         ncpus=4,neednodes=1,nodes=1,walltime=48:00:00
>>>>> Resources:      
>>>>> cput=00:06:55,mem=9964kb,vmem=279836kb,walltime=15:12:46
>>>>>
>>>>> When the job is running, I log into the node, and can see the cpu's 
>>>>> at 100%, so it is not sitting idle and there is not an nfs traffic 
>>>>> to speak of.
>>>>>
>>>>> Anyone run into this issue before?  Is this an mpi issue?
>>>>>
>>>>> -- 
>>>>> Thanks
>>>>> Mary Ellen
>>>>>
>>>>>
>>>>>             
>>>>         
>>> -- 
>>> Thanks
>>> Mary Ellen
>>>
>>>
>>>     
>>
>>
>>   
> 




More information about the mpich-discuss mailing list