[mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period
chong tan
chong_guan_tan at yahoo.com
Thu Jul 16 13:17:03 CDT 2009
over subscribe is an issue, but a rather tiny one in my application. For -n 4,
there will be 20 processes running at startup time, where 16 of those take
less than 0.5 sec total to get into idle mode. that has not been a problem until
I ran into the combination of thread multiple, machine and the particular test case.
BTW, I have not been using that machine for a long time, it is on the
lowest end of the HW spectrum for the problem my application is trying to solve.
Finding this combo was just a pure luck.
I hope these 2 issues provide good research opportunity for the MPICH2
team.
thanks
tan
________________________________
From: Darius Buntinas <buntinas at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Sent: Thursday, July 16, 2009 7:52:21 AM
Subject: Re: [mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period
Yeah, oversubscribing the processors will have this effect (because of a
broken implementation of sched_yield() in the linux kernel since
2.6.23). I'm not exactly sure why thread multiple would make it worse
though. This is something to look into.
Thanks for letting us know about this.
-d
On 07/15/2009 05:42 PM, chong tan wrote:
> I just completed building without -enable_thread=multiple, the slow
> startup and
> nap problem went away.
>
> Regarding the slow start up, it may have something to do with my
> application. When
> my application is run, it actually starts 4 other processes, licensing,
> recording, etc. I can
> see each of these processes being run 1 after another. BTW, I am using
> processor
> affinity, and that may help getting the situation worst.
>
> tan
>
>
> ------------------------------------------------------------------------
> *From:* Darius Buntinas <buntinas at mcs.anl.gov>
> *To:* mpich-discuss at mcs.anl.gov
> *Sent:* Tuesday, July 14, 2009 8:05:00 AM
> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
> processes become idle for extensive period
>
>
> Can you attach a debugger to process 0 and see what it's doing during
> this nap? Once the process 0 finishes the nap, does it send/receive the
> messages to/from the other processes and things continue normally?
>
> Thanks,
> -d
>
> On 07/13/2009 08:57 PM, chong tan wrote:
>> this is the sequence of MPI calls that lead to the 'nap' (all numbers
>> represent proc id per MPICH2) :
>>
>> 0 send to 1, recieved by 1
>> 0 send to 2, recieved by 2
>> 0 sent to 3, recv by 3
>> <application activities, shm called >
>> 1 blocking send to 0, send buffered
>> 1 calls blocking recieve from 0
>> 3 blocking send to 0, send buffered,
>> 3 calls blocking recieve from 0
>> 2 blocking send to 0, send buffered
>> 2 calls blocking recieve from 0
>> <proc 0 execute some application activities>
>> proc 0 become idle
>> <nap time>
>>
>>
>>
>> This is rather strange, it only happens on this particular test. hope
>> this info help
>>
>> tan
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>>
>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>> *Sent:* Monday, July 13, 2009 11:47:38 AM
>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>> processes become idle for extensive period
>>
>>
>> Is there a simpler example of this that you can send us? If nothing
>> else, a binary would be ok.
>>
>> Does the program that takes the 1 minute "nap" use threads? If so, how
>> many threads does each process create?
>>
>> Can you find out what the processes (or threads if it's multithreaded)
>> are doing during this time? E.g., are they in an mpi call? Are they
>> blocking on a mutex? If so, can you tell us what line number it's
>> blocked on?
>>
>> Can you try this without shared memory by setting the environment
>> variable MPICH_NO_LOCAL to 1 and see if you get the same problem?
>> MPICH_NO_LOCAL=1 mpiexec -n 4 ...
>>
>> Thanks,
>> -d
>>
>>
>>
>> On 07/13/2009 01:35 PM, chong tan wrote:
>>> Sorry can't do that. The benchmark involves 2 things. One from my
>>> customer which
>>> I am not allowed to distribute. I may be able to get a limited
>>> license of my product
>>> for you to try, but I definately can not send source code.
>>>
>>> tan
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>
>> <mailto:buntinas at mcs.anl.gov <mailto:buntinas at mcs.anl.gov>>>
>>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> <mailto:mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>>
>>> *Sent:* Monday, July 13, 2009 10:54:50 AM
>>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>>> processes become idle for extensive period
>>>
>>>
>>> Can you send us the benchmark you're using? This will help us figure
>>> out what's going on.
>>>
>>> Thanks,
>>> -d
>>>
>>> On 07/13/2009 12:36 PM, chong tan wrote:
>>>>
>>>> thanks darius,
>>>>
>>>> When I did the comparison (or benchmarking), I have 2 identical source
>>>> trees. Everything
>>>> were recompiled group up and compiled/linked accordinglyto the version
>>>> of MPICH2
>>>> to be used.
>>>>
>>>> I have many tests, this is the only one showing this behavior, and is
>>>> predictably repeatable.
>>>> most of my tests are showing comaptible performance and many do better
>>>> with 1.1.
>>>>
>>>> The 'weirdest' thing is the ~1 minute span where there is no activity on
>>>> the box at all, zipo
>>>> activity except 'top', with machine load at around 0.12. I don't know
>>>> how to explain this
>>>> 'behavior', and I am extremely curious if anyone can explain this.
>>>>
>>>> I can't repeat this on AMD boxes as I don't have one that has only 32G
>>>> of memory. I can't
>>>> repeat this on Niagara box as thread multiple won't build.
>>>>
>>>> I will try to rebuild 1.1 without thread-multiple. Will keep you
> posted.
>>>>
>>>> Meanwhile, if anyone has any speculations on this, please bring them up.
>>>>
>>>> thanks
>>>> tan
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Darius Buntinas <buntinas at mcs.anl.gov
> <mailto:buntinas at mcs.anl.gov>
>> <mailto:buntinas at mcs.anl.gov <mailto:buntinas at mcs.anl.gov>>
>>> <mailto:buntinas at mcs.anl.gov <mailto:buntinas at mcs.anl.gov>
> <mailto:buntinas at mcs.anl.gov <mailto:buntinas at mcs.anl.gov>>>>
>>>> *To:* mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> <mailto:mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>>
>> <mailto:mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> <mailto:mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>>>
>>>> *Sent:* Monday, July 13, 2009 8:30:19 AM
>>>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all
>>>> processes become idle for extensive period
>>>>
>>>> Tan,
>>>>
>>>> Did you just re-link the applications, or did you recompile them?
>>>> Version 1.1 is most likely not binary compatible with 1.0.6, so you
>>>> really need to recompile the application.
>>>>
>>>> Next, don't use the --enable-threads=multiple flag when configuring
>>>> mpich2. By default, mpich2 supports all thread levels and will select
>>>> the thread level at run time (depending on the parameters passed to
>>>> MPI_Init_thread). By allowing the thread level to be selected
>>>> automatically at run time, you'll avoid the overhead of thread safety
>>>> when it's not needed, allowing your non-threaded applications to run
>>> faster.
>>>>
>>>> Let us know if either of these fixes the problem, especially if just
>>>> removing the --enable-threads option fixes this.
>>>>
>>>> Thanks,
>>>> -d
>>>>
>>>> On 07/10/2009 06:19 PM, chong tan wrote:
>>>>> I am seeing this funny situation which I did not see on 1.0.6 and
>>>>> 1.0.8. Some background:
>>>>>
>>>>> machine : INTEL 4Xcore 2
>>>>>
>>>>> running mpiexec -n 4
>>>>>
>>>>> machine has 32G of mem.
>>>>>
>>>>> when my application runs, almost all memory are used. However, there
>>>>> is no swapping.
>>>>> I have exclusive use of the machine, so contention is not an issue.
>>>>>
>>>>> issue #1 : processes take extra long to be initialized, compared to
>>> 1.0.6
>>>>> issue #2 : during the run, at time all of them will become idle at the
>>>>> same time, for almost a
>>>>> minute. We never observed this with 1.0.6
>>>>>
>>>>>
>>>>> The codes are the same, only linked with different versions of MPICH2.
>>>>>
>>>>> MPICH2 was built with --enable-threads=multiple for 1.1. without for
>>>>> 1.0.6 or 1.0.8
>>>>>
>>>>> MPI calls are all in the main application thread. I used only 4 MPI
>>>>> functions :
>>>>> init(), Send(), Recv() and Barrier().
>>>>>
>>>>>
>>>>>
>>>>> any suggestion ?
>>>>>
>>>>> thanks
>>>>> tan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090716/7bfad4f4/attachment.htm>
More information about the mpich-discuss
mailing list