[mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period

chong tan chong_guan_tan at yahoo.com
Mon Jul 13 14:24:10 CDT 2009


The application does have thread: 2 threads to be exact: main thread and recording thread.  THe recording thread is not 
activated in this test.

Based on my knowledge of the application, MPI is the only code that can become 'blocked' beside the recording thread.
No other mutex exists in the code.  Since the recording is not used, MPI is the only suspect.  Besides, the same code
does not show any issue like this when linked with 1.0.6, so threading and MPI is the most likely cause.

I will try the MPICH_NO_LOCAL setting when I am done with my current run.  THat is likely to be tomorrow.

BTW, the executable minus the test data will not reproduce the problem.  I will contact the appropiate department 
for a free lincese of our application.

thanks
tan

 



________________________________
From: Darius Buntinas <buntinas at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Sent: Monday, July 13, 2009 11:47:38 AM
Subject: Re: [mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period


Is there a simpler example of this that you can send us?  If nothing
else, a binary would be ok.

Does the program that takes the 1 minute "nap" use threads?  If so, how
many threads does each process create?

Can you find out what the processes (or threads if it's multithreaded)
are doing during this time?  E.g., are they in an mpi call?  Are they
blocking on a mutex?  If so, can you tell us what line number it's
blocked on?

Can you try this without shared memory by setting the environment
variable MPICH_NO_LOCAL to 1 and see if you get the same problem?
  MPICH_NO_LOCAL=1 mpiexec -n 4 ...

Thanks,
-d



On 07/13/2009 01:35 PM, chong tan wrote:
> Sorry can't do that.  The benchmark involves 2 things.  One from my
> customer which
> I am not allowed to distribute.    I may be able to get a limited
> license of my product
> for you to try, but I definately can not send source code.
>  
> tan
>  
> 
> ------------------------------------------------------------------------
> *From:* Darius Buntinas <buntinas at mcs.anl.gov>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Monday, July 13, 2009 10:54:50 AM
> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
> processes become idle for extensive period
> 
> 
> Can you send us the benchmark you're using?  This will help us figure
> out what's going on.
> 
> Thanks,
> -d
> 
> On 07/13/2009 12:36 PM, chong tan wrote:
>>
>> thanks darius,
>> 
>> When I did the comparison (or benchmarking), I have 2 identical source
>> trees.  Everything
>> were recompiled group up and compiled/linked accordinglyto the version
>> of MPICH2
>> to be used.
>> 
>> I have many tests, this is the only one showing this behavior, and is
>> predictably repeatable.
>> most of my tests are showing comaptible performance and many do better
>> with 1.1.
>> 
>> The 'weirdest' thing is the ~1 minute span where there is no activity on
>> the box at all, zipo
>> activity except 'top', with machine load at around 0.12.  I don't know
>> how to explain this
>> 'behavior', and I am extremely curious if anyone can explain this.
>> 
>> I can't repeat this on AMD boxes as I don't have one that has only 32G
>> of memory.  I can't
>> repeat this on Niagara box as thread multiple won't build.
>> 
>> I will try to rebuild 1.1 without thread-multiple.  Will keep you posted.
>> 
>> Meanwhile, if anyone has any speculations on this, please bring them up.
>> 
>> thanks
>> tan
>> 
>> ------------------------------------------------------------------------
>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>>
>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>> *Sent:* Monday, July 13, 2009 8:30:19 AM
>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>> processes become idle for extensive period
>>
>> Tan,
>>
>> Did you just re-link the applications, or did you recompile them?
>> Version 1.1 is most likely not binary compatible with 1.0.6, so you
>> really need to recompile the application.
>>
>> Next, don't use the --enable-threads=multiple flag when configuring
>> mpich2.  By default, mpich2 supports all thread levels and will select
>> the thread level at run time (depending on the parameters passed to
>> MPI_Init_thread).  By allowing the thread level to be selected
>> automatically at run time, you'll avoid the overhead of thread safety
>> when it's not needed, allowing your non-threaded applications to run
> faster.
>>
>> Let us know if either of these fixes the problem, especially if just
>> removing the --enable-threads option fixes this.
>>
>> Thanks,
>> -d
>>
>> On 07/10/2009 06:19 PM, chong tan wrote:
>>> I am seeing this funny situation which I did not see on 1.0.6 and
>>> 1.0.8.  Some background:
>>>
>>> machine : INTEL 4Xcore 2
>>>
>>> running mpiexec -n 4
>>>
>>> machine has 32G of mem.
>>>
>>> when my application runs,  almost all memory are used.  However, there
>>> is no swapping.
>>> I have exclusive use of the machine, so contention is not an issue.
>>>
>>> issue #1 :  processes take extra long to be initialized, compared to
> 1.0.6
>>> issue #2 : during the run, at time all of them will become idle at the
>>> same time, for almost a
>>>                minute.  We never observed this with 1.0.6
>>>
>>>
>>> The codes are the same, only linked with different versions of MPICH2.
>>>
>>> MPICH2 was built with --enable-threads=multiple for 1.1.  without for
>>> 1.0.6 or 1.0.8
>>>
>>> MPI calls are all in the main application thread.  I used only 4 MPI
>>> functions :
>>> init(), Send(), Recv() and Barrier().
>>>
>>>
>>>
>>> any suggestion ?
>>>
>>> thanks
>>> tan
>>>
>>>
>>>
>>>  
>>>
>>>
>>
> 



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090713/138f68f5/attachment.htm>


More information about the mpich-discuss mailing list