[mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period

chong tan chong_guan_tan at yahoo.com
Wed Jul 15 16:13:28 CDT 2009



I will do that when I find the time, and get the machine for my exclusive use.
Meanwhile, I reran the same test on a different machine, AMD 4xDual with 64 G,
I did not see the 'nap'.  Given that AMD is more performance sensitive to the use of 
global shared mem,  I think there is something peculiar about 'memory almost all
consumed' situation.

tan



________________________________
From: Darius Buntinas <buntinas at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Sent: Tuesday, July 14, 2009 8:05:00 AM
Subject: Re: [mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period


Can you attach a debugger to process 0 and see what it's doing during
this nap?  Once the process 0 finishes the nap, does it send/receive the
messages to/from the other processes and things continue normally?

Thanks,
-d

On 07/13/2009 08:57 PM, chong tan wrote:
> this is the sequence of MPI calls that lead to the 'nap' (all numbers
> represent proc id per MPICH2) :
>  
> 0 send to 1, recieved by 1
> 0 send to 2, recieved by 2
> 0 sent to 3, recv by 3
> <application activities, shm called >
> 1 blocking send to 0, send buffered
> 1 calls blocking recieve from 0
> 3 blocking send to 0, send buffered,
> 3 calls blocking recieve from 0
> 2 blocking send to 0, send buffered
> 2 calls blocking recieve from 0
> <proc 0 execute some application activities>
> proc 0 become idle
> <nap time>
>  
>  
>  
> This is rather strange, it only happens on this particular test.   hope
> this info help
>  
> tan
>  
>  
>  
>  
>  
> 
>  
> 
> ------------------------------------------------------------------------
> *From:* Darius Buntinas <buntinas at mcs.anl.gov>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Monday, July 13, 2009 11:47:38 AM
> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
> processes become idle for extensive period
> 
> 
> Is there a simpler example of this that you can send us?  If nothing
> else, a binary would be ok.
> 
> Does the program that takes the 1 minute "nap" use threads?  If so, how
> many threads does each process create?
> 
> Can you find out what the processes (or threads if it's multithreaded)
> are doing during this time?  E.g., are they in an mpi call?  Are they
> blocking on a mutex?  If so, can you tell us what line number it's
> blocked on?
> 
> Can you try this without shared memory by setting the environment
> variable MPICH_NO_LOCAL to 1 and see if you get the same problem?
>   MPICH_NO_LOCAL=1 mpiexec -n 4 ...
> 
> Thanks,
> -d
> 
> 
> 
> On 07/13/2009 01:35 PM, chong tan wrote:
>> Sorry can't do that.  The benchmark involves 2 things.  One from my
>> customer which
>> I am not allowed to distribute.    I may be able to get a limited
>> license of my product
>> for you to try, but I definately can not send source code.
>> 
>> tan
>> 
>>
>> ------------------------------------------------------------------------
>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>>
>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>> *Sent:* Monday, July 13, 2009 10:54:50 AM
>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>> processes become idle for extensive period
>>
>>
>> Can you send us the benchmark you're using?  This will help us figure
>> out what's going on.
>>
>> Thanks,
>> -d
>>
>> On 07/13/2009 12:36 PM, chong tan wrote:
>>>
>>> thanks darius,
>>>
>>> When I did the comparison (or benchmarking), I have 2 identical source
>>> trees.  Everything
>>> were recompiled group up and compiled/linked accordinglyto the version
>>> of MPICH2
>>> to be used.
>>>
>>> I have many tests, this is the only one showing this behavior, and is
>>> predictably repeatable.
>>> most of my tests are showing comaptible performance and many do better
>>> with 1.1.
>>>
>>> The 'weirdest' thing is the ~1 minute span where there is no activity on
>>> the box at all, zipo
>>> activity except 'top', with machine load at around 0.12.  I don't know
>>> how to explain this
>>> 'behavior', and I am extremely curious if anyone can explain this.
>>>
>>> I can't repeat this on AMD boxes as I don't have one that has only 32G
>>> of memory.  I can't
>>> repeat this on Niagara box as thread multiple won't build.
>>>
>>> I will try to rebuild 1.1 without thread-multiple.  Will keep you posted.
>>>
>>> Meanwhile, if anyone has any speculations on this, please bring them up.
>>>
>>> thanks
>>> tan
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>
>> <mailto:buntinas at mcs.anl.gov <mailto:buntinas at mcs.anl.gov>>>
>>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> <mailto:mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>>
>>> *Sent:* Monday, July 13, 2009 8:30:19 AM
>>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>>> processes become idle for extensive period
>>>
>>> Tan,
>>>
>>> Did you just re-link the applications, or did you recompile them?
>>> Version 1.1 is most likely not binary compatible with 1.0.6, so you
>>> really need to recompile the application.
>>>
>>> Next, don't use the --enable-threads=multiple flag when configuring
>>> mpich2.  By default, mpich2 supports all thread levels and will select
>>> the thread level at run time (depending on the parameters passed to
>>> MPI_Init_thread).  By allowing the thread level to be selected
>>> automatically at run time, you'll avoid the overhead of thread safety
>>> when it's not needed, allowing your non-threaded applications to run
>> faster.
>>>
>>> Let us know if either of these fixes the problem, especially if just
>>> removing the --enable-threads option fixes this.
>>>
>>> Thanks,
>>> -d
>>>
>>> On 07/10/2009 06:19 PM, chong tan wrote:
>>>> I am seeing this funny situation which I did not see on 1.0.6 and
>>>> 1.0.8.  Some background:
>>>>
>>>> machine : INTEL 4Xcore 2
>>>>
>>>> running mpiexec -n 4
>>>>
>>>> machine has 32G of mem.
>>>>
>>>> when my application runs,  almost all memory are used.  However, there
>>>> is no swapping.
>>>> I have exclusive use of the machine, so contention is not an issue.
>>>>
>>>> issue #1 :  processes take extra long to be initialized, compared to
>> 1.0.6
>>>> issue #2 : during the run, at time all of them will become idle at the
>>>> same time, for almost a
>>>>                minute.  We never observed this with 1.0.6
>>>>
>>>>
>>>> The codes are the same, only linked with different versions of MPICH2.
>>>>
>>>> MPICH2 was built with --enable-threads=multiple for 1.1.  without for
>>>> 1.0.6 or 1.0.8
>>>>
>>>> MPI calls are all in the main application thread.  I used only 4 MPI
>>>> functions :
>>>> init(), Send(), Recv() and Barrier().
>>>>
>>>>
>>>>
>>>> any suggestion ?
>>>>
>>>> thanks
>>>> tan
>>>>
>>>>
>>>>
>>>> 
>>>>
>>>>
>>>
>>
> 



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090715/596db240/attachment.htm>


More information about the mpich-discuss mailing list