[petsc-users] Hang at PetscLayoutSetUp()

Derek Gaston friedmud at gmail.com
Wed Feb 8 04:34:37 CST 2012


On Tue, Feb 7, 2012 at 7:56 PM, Matthew Knepley <knepley at gmail.com> wrote:

> Same question: are you sure every process is there. I will bet $10 there
>  is at least one missing.
>

Well, of course you guys were right!  I was able to write a short piece of
python (pasted at the bottom of this email in case others find it
interesting / useful) to look through the stack trace of every one of my
processes in given job on the cluster and tell me which ones weren't in the
place I was expecting them to be.  Once I identified these processes I was
able to ssh to the individual nodes and use gdb to attach to those
processes and get stacktraces out to figure out what in the heck they were
doing.

Two things came up, both in a library our software depends on:

1.  There was an inadvertent vector "localize" operation happening on the
solution vector.  This means that we were making a complete _local_ copy of
the parallel solution vector to every processor!  That would sometimes fail
with 400 Million Dofs and 8000+ MPI ;-)

2.  A small optimization problem having to do with threading and allocation
/ deallocation of small vectors.  This was just slowing some of the nodes
down so much that it would look like the processes had hung.

After fixing up both of these things the jobs are now moving.

Thanks for the suggestion ;-)

Below is the python script I used to figure this stuff out.  It works by
calling the script with the job # of the job you want to analyze.  It is
set up for PBS, so if your cluster doesn't use PBS you'll have to disregard
the top part where I'm just parsing out the list of nodes the job is
running on. "fission-\d\d\d\d" is the regex pattern my cluster's nodes are
named in.

MatCreateMPIAIJ was what I was looking for in the stack trace (I wanted to
see if every process had made it there).  gastdr was my username, replace
with yours.  "marmot" was the name of the executable I was running.
 "bad_hosts" gets filled up with the number of processes owned by you on
each node that have a stack trace containing the string you were looking
for.  Then at the end I analyzed that to see if it matched how many MPI per
node I was running (in this case 4).  Any host that had less than 4
processes on it that were where I was expecting them to be got spit out at
the end.  Then it was time to ssh to that node and attach to the processes
and figure out what was going on.

It's 3:30AM here right now, so you'll have to excuse some of the rough
edges in the script.  I really just hacked it together for myself but
thought others might find some pieces useful from it.  Oh, and yes, I did
use os.popen()... after all these years I still find it more
straightforward to use than any of the subprocesses stuff in Python.  It
has been deprecated for a _long_ time now... but I hope they never remove
it ;-)

Happy hunting all!

Derek


-------

import os
import sys
import re

command = "qstat -n " + sys.argv[1]

output = os.popen(command).readlines()

regex = re.compile("(fission-\d\d\d\d)")

hosts = []

for line in output:
  f = regex.findall(line)
  for i in f:
    hosts.append(i)

matcreates = 0
bad_hosts = {}
#host = hosts[0]
for host in hosts:
  command = "ssh " + host + " \"ps aux | grep 'gastdr .*marmot' | grep -v
grep | awk '{print \$2}' | xargs -I {} gdb --batch --pid={} -ex bt | grep
'MatCreateMPIAIJ' 2>/dev/null \""
  lines = os.popen(command).readlines()
  for line in lines:
    if line.find("MatCreateMPIAIJ") != -1:
      matcreates = matcreates + 1
      if host in bad_hosts:
        bad_hosts[host] += 1
      else:
        bad_hosts[host] = 1

print bad_hosts

print "Num matches: " + str(matcreates)

print "Bad Hosts: "
for host, num in bad_hosts.items():
  if num != 4:
    print host
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120208/43e418c7/attachment-0001.htm>


More information about the petsc-users mailing list