<html><head><style type="text/css"><!-- DIV {margin:0px;} --></style></head><body><div style="font-family:times new roman,new york,times,serif;font-size:12pt"><div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;">I will do that when I find the time, and get the machine for my exclusive use.<br>Meanwhile, I reran the same test on a different machine, AMD 4xDual with 64 G,<br>I did not see the 'nap'. Given that AMD is more performance sensitive to the use of <br>global shared mem, I think there is something peculiar about 'memory almost all<br>consumed' situation.<br><br>tan<br><br><div style="font-family: arial,helvetica,sans-serif; font-size: 13px;"><font face="Tahoma" size="2"><hr size="1"><b><span style="font-weight: bold;">From:</span></b> Darius Buntinas <buntinas@mcs.anl.gov><br><b><span style="font-weight: bold;">To:</span></b> mpich-discuss@mcs.anl.gov<br><b><span style="font-weight:
bold;">Sent:</span></b> Tuesday, July 14, 2009 8:05:00 AM<br><b><span style="font-weight: bold;">Subject:</span></b> Re: [mpich-discuss] version 1.1 strange behavior : all processes become idle for extensive period<br></font><br>
<br>Can you attach a debugger to process 0 and see what it's doing during<br>this nap? Once the process 0 finishes the nap, does it send/receive the<br>messages to/from the other processes and things continue normally?<br><br>Thanks,<br>-d<br><br>On 07/13/2009 08:57 PM, chong tan wrote:<br>> this is the sequence of MPI calls that lead to the 'nap' (all numbers<br>> represent proc id per MPICH2) :<br>> <br>> 0 send to 1, recieved by 1<br>> 0 send to 2, recieved by 2<br>> 0 sent to 3, recv by 3<br>> <application activities, shm called ><br>> 1 blocking send to 0, send buffered<br>> 1 calls blocking recieve from 0<br>> 3 blocking send to 0, send buffered,<br>> 3 calls blocking recieve from 0<br>> 2 blocking send to 0, send buffered<br>> 2 calls blocking recieve from 0<br>> <proc 0 execute some application activities><br>> proc 0 become idle<br>> <nap time><br>>
<br>> <br>> <br>> This is rather strange, it only happens on this particular test. hope<br>> this info help<br>> <br>> tan<br>> <br>> <br>> <br>> <br>> <br>> <br>> <br>> <br>> ------------------------------------------------------------------------<br>> *From:* Darius Buntinas <<a ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>><br>> *To:* <a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>> *Sent:* Monday, July 13, 2009 11:47:38 AM<br>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all<br>> processes become idle for extensive period<br>> <br>> <br>> Is there a simpler example of this that you can send us? If nothing<br>> else, a binary would be ok.<br>> <br>> Does the
program that takes the 1 minute "nap" use threads? If so, how<br>> many threads does each process create?<br>> <br>> Can you find out what the processes (or threads if it's multithreaded)<br>> are doing during this time? E.g., are they in an mpi call? Are they<br>> blocking on a mutex? If so, can you tell us what line number it's<br>> blocked on?<br>> <br>> Can you try this without shared memory by setting the environment<br>> variable MPICH_NO_LOCAL to 1 and see if you get the same problem?<br>> MPICH_NO_LOCAL=1 mpiexec -n 4 ...<br>> <br>> Thanks,<br>> -d<br>> <br>> <br>> <br>> On 07/13/2009 01:35 PM, chong tan wrote:<br>>> Sorry can't do that. The benchmark involves 2 things. One from my<br>>> customer which<br>>> I am not allowed to distribute. I may be able to get a limited<br>>> license of my product<br>>> for you
to try, but I definately can not send source code.<br>>> <br>>> tan<br>>> <br>>><br>>> ------------------------------------------------------------------------<br>>> *From:* Darius Buntinas <<a ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a><br>> <mailto:<a ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>>><br>>> *To:* <a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a> <mailto:<a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a>><br>>> *Sent:* Monday, July 13, 2009 10:54:50 AM<br>>> *Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all<br>>> processes become idle for extensive period<br>>><br>>><br>>> Can you send us the
benchmark you're using? This will help us figure<br>>> out what's going on.<br>>><br>>> Thanks,<br>>> -d<br>>><br>>> On 07/13/2009 12:36 PM, chong tan wrote:<br>>>><br>>>> thanks darius,<br>>>><br>>>> When I did the comparison (or benchmarking), I have 2 identical source<br>>>> trees. Everything<br>>>> were recompiled group up and compiled/linked accordinglyto the version<br>>>> of MPICH2<br>>>> to be used.<br>>>><br>>>> I have many tests, this is the only one showing this behavior, and is<br>>>> predictably repeatable.<br>>>> most of my tests are showing comaptible performance and many do better<br>>>> with 1.1.<br>>>><br>>>> The 'weirdest' thing is the ~1 minute span where there is no activity on<br>>>> the box at all, zipo<br>>>> activity except 'top',
with machine load at around 0.12. I don't know<br>>>> how to explain this<br>>>> 'behavior', and I am extremely curious if anyone can explain this.<br>>>><br>>>> I can't repeat this on AMD boxes as I don't have one that has only 32G<br>>>> of memory. I can't<br>>>> repeat this on Niagara box as thread multiple won't build.<br>>>><br>>>> I will try to rebuild 1.1 without thread-multiple. Will keep you posted.<br>>>><br>>>> Meanwhile, if anyone has any speculations on this, please bring them up.<br>>>><br>>>> thanks<br>>>> tan<br>>>><br>>>> ------------------------------------------------------------------------<br>>>> *From:* Darius Buntinas <<a ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a><br>> <mailto:<a
ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>><br>>> <mailto:<a ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a> <mailto:<a ymailto="mailto:buntinas@mcs.anl.gov" href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>>>><br>>>> *To:* <a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a> <mailto:<a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a>><br>> <mailto:<a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a> <mailto:<a ymailto="mailto:mpich-discuss@mcs.anl.gov" href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a>>><br>>>> *Sent:* Monday, July 13, 2009 8:30:19 AM<br>>>>
*Subject:* Re: [mpich-discuss] version 1.1 strange behavior : all<br>>>> processes become idle for extensive period<br>>>><br>>>> Tan,<br>>>><br>>>> Did you just re-link the applications, or did you recompile them?<br>>>> Version 1.1 is most likely not binary compatible with 1.0.6, so you<br>>>> really need to recompile the application.<br>>>><br>>>> Next, don't use the --enable-threads=multiple flag when configuring<br>>>> mpich2. By default, mpich2 supports all thread levels and will select<br>>>> the thread level at run time (depending on the parameters passed to<br>>>> MPI_Init_thread). By allowing the thread level to be selected<br>>>> automatically at run time, you'll avoid the overhead of thread safety<br>>>> when it's not needed, allowing your non-threaded applications to run<br>>>
faster.<br>>>><br>>>> Let us know if either of these fixes the problem, especially if just<br>>>> removing the --enable-threads option fixes this.<br>>>><br>>>> Thanks,<br>>>> -d<br>>>><br>>>> On 07/10/2009 06:19 PM, chong tan wrote:<br>>>>> I am seeing this funny situation which I did not see on 1.0.6 and<br>>>>> 1.0.8. Some background:<br>>>>><br>>>>> machine : INTEL 4Xcore 2<br>>>>><br>>>>> running mpiexec -n 4<br>>>>><br>>>>> machine has 32G of mem.<br>>>>><br>>>>> when my application runs, almost all memory are used. However, there<br>>>>> is no swapping.<br>>>>> I have exclusive use of the machine, so contention is not an issue.<br>>>>><br>>>>> issue #1 : processes take extra long to be
initialized, compared to<br>>> 1.0.6<br>>>>> issue #2 : during the run, at time all of them will become idle at the<br>>>>> same time, for almost a<br>>>>> minute. We never observed this with 1.0.6<br>>>>><br>>>>><br>>>>> The codes are the same, only linked with different versions of MPICH2.<br>>>>><br>>>>> MPICH2 was built with --enable-threads=multiple for 1.1. without for<br>>>>> 1.0.6 or 1.0.8<br>>>>><br>>>>> MPI calls are all in the main application thread. I used only 4 MPI<br>>>>> functions :<br>>>>> init(), Send(), Recv() and Barrier().<br>>>>><br>>>>><br>>>>><br>>>>> any suggestion ?<br>>>>><br>>>>> thanks<br>>>>>
tan<br>>>>><br>>>>><br>>>>><br>>>>> <br>>>>><br>>>>><br>>>><br>>><br>> <br></div></div></div><br>
</body></html>