[mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period
Darius Buntinas
buntinas at mcs.anl.gov
Tue Jul 14 10:05:00 CDT 2009
Can you attach a debugger to process 0 and see what it's doing during
this nap? Once the process 0 finishes the nap, does it send/receive the
messages to/from the other processes and things continue normally?
Thanks,
-d
On 07/13/2009 08:57 PM, chong tan wrote:
> this is the sequence of MPI calls that lead to the 'nap' (all numbers
> represent proc id per MPICH2) :
>
> 0 send to 1, recieved by 1
> 0 send to 2, recieved by 2
> 0 sent to 3, recv by 3
> <application activities, shm called >
> 1 blocking send to 0, send buffered
> 1 calls blocking recieve from 0
> 3 blocking send to 0, send buffered,
> 3 calls blocking recieve from 0
> 2 blocking send to 0, send buffered
> 2 calls blocking recieve from 0
> <proc 0 execute some application activities>
> proc 0 become idle
> <nap time>
>
>
>
> This is rather strange, it only happens on this particular test. hope
> this info help
>
> tan
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
> *From:* Darius Buntinas <buntinas at mcs.anl.gov>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Monday, July 13, 2009 11:47:38 AM
> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
> processes become idle for extensive period
>
>
> Is there a simpler example of this that you can send us? If nothing
> else, a binary would be ok.
>
> Does the program that takes the 1 minute "nap" use threads? If so, how
> many threads does each process create?
>
> Can you find out what the processes (or threads if it's multithreaded)
> are doing during this time? E.g., are they in an mpi call? Are they
> blocking on a mutex? If so, can you tell us what line number it's
> blocked on?
>
> Can you try this without shared memory by setting the environment
> variable MPICH_NO_LOCAL to 1 and see if you get the same problem?
> MPICH_NO_LOCAL=1 mpiexec -n 4 ...
>
> Thanks,
> -d
>
>
>
> On 07/13/2009 01:35 PM, chong tan wrote:
>> Sorry can't do that. The benchmark involves 2 things. One from my
>> customer which
>> I am not allowed to distribute. I may be able to get a limited
>> license of my product
>> for you to try, but I definately can not send source code.
>>
>> tan
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>>
>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>> *Sent:* Monday, July 13, 2009 10:54:50 AM
>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>> processes become idle for extensive period
>>
>>
>> Can you send us the benchmark you're using? This will help us figure
>> out what's going on.
>>
>> Thanks,
>> -d
>>
>> On 07/13/2009 12:36 PM, chong tan wrote:
>>>
>>> thanks darius,
>>>
>>> When I did the comparison (or benchmarking), I have 2 identical source
>>> trees. Everything
>>> were recompiled group up and compiled/linked accordinglyto the version
>>> of MPICH2
>>> to be used.
>>>
>>> I have many tests, this is the only one showing this behavior, and is
>>> predictably repeatable.
>>> most of my tests are showing comaptible performance and many do better
>>> with 1.1.
>>>
>>> The 'weirdest' thing is the ~1 minute span where there is no activity on
>>> the box at all, zipo
>>> activity except 'top', with machine load at around 0.12. I don't know
>>> how to explain this
>>> 'behavior', and I am extremely curious if anyone can explain this.
>>>
>>> I can't repeat this on AMD boxes as I don't have one that has only 32G
>>> of memory. I can't
>>> repeat this on Niagara box as thread multiple won't build.
>>>
>>> I will try to rebuild 1.1 without thread-multiple. Will keep you posted.
>>>
>>> Meanwhile, if anyone has any speculations on this, please bring them up.
>>>
>>> thanks
>>> tan
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>
>> <mailto:buntinas at mcs.anl.gov <mailto:buntinas at mcs.anl.gov>>>
>>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> <mailto:mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>>
>>> *Sent:* Monday, July 13, 2009 8:30:19 AM
>>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>>> processes become idle for extensive period
>>>
>>> Tan,
>>>
>>> Did you just re-link the applications, or did you recompile them?
>>> Version 1.1 is most likely not binary compatible with 1.0.6, so you
>>> really need to recompile the application.
>>>
>>> Next, don't use the --enable-threads=multiple flag when configuring
>>> mpich2. By default, mpich2 supports all thread levels and will select
>>> the thread level at run time (depending on the parameters passed to
>>> MPI_Init_thread). By allowing the thread level to be selected
>>> automatically at run time, you'll avoid the overhead of thread safety
>>> when it's not needed, allowing your non-threaded applications to run
>> faster.
>>>
>>> Let us know if either of these fixes the problem, especially if just
>>> removing the --enable-threads option fixes this.
>>>
>>> Thanks,
>>> -d
>>>
>>> On 07/10/2009 06:19 PM, chong tan wrote:
>>>> I am seeing this funny situation which I did not see on 1.0.6 and
>>>> 1.0.8. Some background:
>>>>
>>>> machine : INTEL 4Xcore 2
>>>>
>>>> running mpiexec -n 4
>>>>
>>>> machine has 32G of mem.
>>>>
>>>> when my application runs, almost all memory are used. However, there
>>>> is no swapping.
>>>> I have exclusive use of the machine, so contention is not an issue.
>>>>
>>>> issue #1 : processes take extra long to be initialized, compared to
>> 1.0.6
>>>> issue #2 : during the run, at time all of them will become idle at the
>>>> same time, for almost a
>>>> minute. We never observed this with 1.0.6
>>>>
>>>>
>>>> The codes are the same, only linked with different versions of MPICH2.
>>>>
>>>> MPICH2 was built with --enable-threads=multiple for 1.1. without for
>>>> 1.0.6 or 1.0.8
>>>>
>>>> MPI calls are all in the main application thread. I used only 4 MPI
>>>> functions :
>>>> init(), Send(), Recv() and Barrier().
>>>>
>>>>
>>>>
>>>> any suggestion ?
>>>>
>>>> thanks
>>>> tan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
More information about the mpich-discuss
mailing list