<div>Hi Pavan,</div><div><br></div><div>Thanks a lot for your quick answer.</div><div><br></div><div>Unfortunately, one of our main challenges right now is to identify a piece of code where program always fails.</div><div>
We tried to isolate the code and create some test programs, but we couldn't find one that reproduces it so far.</div><div><br></div><div>However, we tried a different approach by tracing every MPI call from our code and we discovered a curious behavior I would like to share: we have a MPI_BCAST call in which one of all the 32 processes didn't return from it. The program then got frozen and we could check that all other processes were stuck waiting it ahead in the code.</div>
<div><br></div><div>We did some batch runs and we could reproduce this same behavior some of the times.</div><div><br></div><div>There are two questions I would like to ask:</div><div>1) I thought (I've read it somewhere I can't remember on the web) that all collective MPI operations, including the MPI_BCAST function, would work as a barrier (but we could not check this in the case above). Is this right?</div>
<div>2) Is there a way to trace MPI in some fashion that could help us on debugging this problem? Perhaps a MPI "debug" argument. What would you recommend? (We are running on Linux, on Amazon EC2)</div><div><br>
</div><div>Regards,</div><div>Luiz</div><div><br><br><div class="gmail_quote">
On 28 February 2012 20:08, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov" target="_blank">balaji@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Can you send us a test program where it fails? Below are the answers to your questions.<div><br>
<br>
On 02/28/2012 05:03 PM, Luiz Carlos Costa Junior wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
1) Is there any limitation on the size of the buffer that is sent?<br>
</blockquote>
<br></div>
No. You might run into a limit with the "int" parameter for the datatype count if you send more than 2 billion elements, but you can easily workaround that by creating a datatype with larger datatypes (double or larger contiguous types). But with 12MB data, that's irrelevant.<div>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2) If this limit exists, would it be related to the number of the<br>
process of the communicator? in this case, I am using 32 processes, but<br>
I commonly had success with bigger clusters (over 200 processes).<br>
</blockquote>
<br></div>
Number of processes don't matter.<div><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
3) Is the content of data being sent relevant? If I have some<br>
uninitialized data, would it be a concern? In other words, I understand<br>
that the only thing that matters is that the buffer size must be correct<br>
in all process (any combination of datatype/array size) and there must<br>
be enough allocated space to receive the data, right?<br>
</blockquote>
<br></div>
No. MPI doesn't care what you are sending.<div><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
4) How is the best way to send this data? Split it in smaller broadcasts<br>
might be better/safer?<br>
</blockquote>
<br></div>
One broadcast is the best. If you see better performance by doing multiple smaller broadcasts, that'll be classified as a bug in our code.<div><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
5) How should I classify a 12 MB message? Small? Big? I believe it<br>
should be pretty small because I also have other typical executions<br>
instances with messages over 100 MB that had sucess.<br>
</blockquote>
<br></div>
It's small enough that it should work fine.<span><font color="#888888"><br>
<br>
-- <br>
Pavan Balaji<br>
<a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
</font></span></blockquote></div><br></div>