<div class="gmail_quote">On Tue, Feb 7, 2012 at 7:56 PM, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5">Same question: are you sure every process is there. I will bet $10 there is at least one missing.</div></div></blockquote><div><br></div><div>Well, of course you guys were right! I was able to write a short piece of python (pasted at the bottom of this email in case others find it interesting / useful) to look through the stack trace of every one of my processes in given job on the cluster and tell me which ones weren't in the place I was expecting them to be. Once I identified these processes I was able to ssh to the individual nodes and use gdb to attach to those processes and get stacktraces out to figure out what in the heck they were doing.</div>
<div><br></div><div>Two things came up, both in a library our software depends on:</div><div><br></div><div>1. There was an inadvertent vector "localize" operation happening on the solution vector. This means that we were making a complete _local_ copy of the parallel solution vector to every processor! That would sometimes fail with 400 Million Dofs and 8000+ MPI ;-)</div>
<div><br></div><div>2. A small optimization problem having to do with threading and allocation / deallocation of small vectors. This was just slowing some of the nodes down so much that it would look like the processes had hung.</div>
<div><br></div><div>After fixing up both of these things the jobs are now moving.</div><div><br></div><div>Thanks for the suggestion ;-)</div><div><br></div><div>Below is the python script I used to figure this stuff out. It works by calling the script with the job # of the job you want to analyze. It is set up for PBS, so if your cluster doesn't use PBS you'll have to disregard the top part where I'm just parsing out the list of nodes the job is running on. "fission-\d\d\d\d" is the regex pattern my cluster's nodes are named in.</div>
<div><br></div><div>MatCreateMPIAIJ was what I was looking for in the stack trace (I wanted to see if every process had made it there). gastdr was my username, replace with yours. "marmot" was the name of the executable I was running. "bad_hosts" gets filled up with the number of processes owned by you on each node that have a stack trace containing the string you were looking for. Then at the end I analyzed that to see if it matched how many MPI per node I was running (in this case 4). Any host that had less than 4 processes on it that were where I was expecting them to be got spit out at the end. Then it was time to ssh to that node and attach to the processes and figure out what was going on.</div>
<div><br></div><div>It's 3:30AM here right now, so you'll have to excuse some of the rough edges in the script. I really just hacked it together for myself but thought others might find some pieces useful from it. Oh, and yes, I did use os.popen()... after all these years I still find it more straightforward to use than any of the subprocesses stuff in Python. It has been deprecated for a _long_ time now... but I hope they never remove it ;-)</div>
<div><br></div><div>Happy hunting all!</div><div><br></div><div>Derek</div><div><br></div><div><br></div><div>-------</div><div><br></div><div><div>import os</div><div>import sys</div><div>import re</div><div><br></div><div>
command = "qstat -n " + sys.argv[1]</div><div><br></div><div>output = os.popen(command).readlines()</div><div><br></div><div>regex = re.compile("(fission-\d\d\d\d)")</div><div><br></div><div>hosts = []</div>
<div><br></div><div>for line in output:</div><div> f = regex.findall(line)</div><div> for i in f:</div><div> hosts.append(i)</div><div><br></div><div>matcreates = 0</div><div>bad_hosts = {}</div><div>#host = hosts[0]</div>
<div>for host in hosts:</div><div> command = "ssh " + host + " \"ps aux | grep 'gastdr .*marmot' | grep -v grep | awk '{print \$2}' | xargs -I {} gdb --batch --pid={} -ex bt | grep 'MatCreateMPIAIJ' 2>/dev/null \""</div>
<div> lines = os.popen(command).readlines()</div><div> for line in lines:</div><div> if line.find("MatCreateMPIAIJ") != -1:</div><div> matcreates = matcreates + 1</div><div> if host in bad_hosts:</div>
<div> bad_hosts[host] += 1</div><div> else:</div><div> bad_hosts[host] = 1</div><div><br></div><div>print bad_hosts</div><div><br></div><div>print "Num matches: " + str(matcreates)</div><div><br>
</div><div>print "Bad Hosts: "</div><div>for host, num in bad_hosts.items():</div><div> if num != 4:</div><div> print host</div></div><div><br></div></div>