[petsc-users] Hang at PetscLayoutSetUp()
Dmitry Karpeev
karpeev at mcs.anl.gov
Wed Feb 8 07:41:38 CST 2012
Good to hear you got it sorted out!
I understand that PetscLayoutSetup() and VecScatterCreate() problems were
unrelated, is that right?
The PetscLayoutSetup() was hanging because some nodes took too long to
enter the call -- they were waiting
for mutex locks somewhere else; VecScatterCreate() was "hanging" due to the
sheer number
of indices in the IS (and building it as a PtoP, while it should have been
MPI_ToAll)?
Dmitry.
On Wed, Feb 8, 2012 at 4:34 AM, Derek Gaston <friedmud at gmail.com> wrote:
> On Tue, Feb 7, 2012 at 7:56 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> Same question: are you sure every process is there. I will bet $10 there
>> is at least one missing.
>>
>
> Well, of course you guys were right! I was able to write a short piece of
> python (pasted at the bottom of this email in case others find it
> interesting / useful) to look through the stack trace of every one of my
> processes in given job on the cluster and tell me which ones weren't in the
> place I was expecting them to be. Once I identified these processes I was
> able to ssh to the individual nodes and use gdb to attach to those
> processes and get stacktraces out to figure out what in the heck they were
> doing.
>
> Two things came up, both in a library our software depends on:
>
> 1. There was an inadvertent vector "localize" operation happening on the
> solution vector. This means that we were making a complete _local_ copy of
> the parallel solution vector to every processor! That would sometimes fail
> with 400 Million Dofs and 8000+ MPI ;-)
>
> 2. A small optimization problem having to do with threading and
> allocation / deallocation of small vectors. This was just slowing some of
> the nodes down so much that it would look like the processes had hung.
>
> After fixing up both of these things the jobs are now moving.
>
> Thanks for the suggestion ;-)
>
> Below is the python script I used to figure this stuff out. It works by
> calling the script with the job # of the job you want to analyze. It is
> set up for PBS, so if your cluster doesn't use PBS you'll have to disregard
> the top part where I'm just parsing out the list of nodes the job is
> running on. "fission-\d\d\d\d" is the regex pattern my cluster's nodes are
> named in.
>
> MatCreateMPIAIJ was what I was looking for in the stack trace (I wanted to
> see if every process had made it there). gastdr was my username, replace
> with yours. "marmot" was the name of the executable I was running.
> "bad_hosts" gets filled up with the number of processes owned by you on
> each node that have a stack trace containing the string you were looking
> for. Then at the end I analyzed that to see if it matched how many MPI per
> node I was running (in this case 4). Any host that had less than 4
> processes on it that were where I was expecting them to be got spit out at
> the end. Then it was time to ssh to that node and attach to the processes
> and figure out what was going on.
>
> It's 3:30AM here right now, so you'll have to excuse some of the rough
> edges in the script. I really just hacked it together for myself but
> thought others might find some pieces useful from it. Oh, and yes, I did
> use os.popen()... after all these years I still find it more
> straightforward to use than any of the subprocesses stuff in Python. It
> has been deprecated for a _long_ time now... but I hope they never remove
> it ;-)
>
> Happy hunting all!
>
> Derek
>
>
> -------
>
> import os
> import sys
> import re
>
> command = "qstat -n " + sys.argv[1]
>
> output = os.popen(command).readlines()
>
> regex = re.compile("(fission-\d\d\d\d)")
>
> hosts = []
>
> for line in output:
> f = regex.findall(line)
> for i in f:
> hosts.append(i)
>
> matcreates = 0
> bad_hosts = {}
> #host = hosts[0]
> for host in hosts:
> command = "ssh " + host + " \"ps aux | grep 'gastdr .*marmot' | grep -v
> grep | awk '{print \$2}' | xargs -I {} gdb --batch --pid={} -ex bt | grep
> 'MatCreateMPIAIJ' 2>/dev/null \""
> lines = os.popen(command).readlines()
> for line in lines:
> if line.find("MatCreateMPIAIJ") != -1:
> matcreates = matcreates + 1
> if host in bad_hosts:
> bad_hosts[host] += 1
> else:
> bad_hosts[host] = 1
>
> print bad_hosts
>
> print "Num matches: " + str(matcreates)
>
> print "Bad Hosts: "
> for host, num in bad_hosts.items():
> if num != 4:
> print host
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120208/53c5a72e/attachment.htm>
More information about the petsc-users
mailing list