Stalling once linear system becomes a certain size

Mon Apr 7 08:28:18 CDT 2008

On Mon, 7 Apr 2008, David Knezevic wrote:

> Hello,
> 
> I am trying to run a PETSc code on a parallel machine (it may be relevant that
> each node contains four AMD Opteron Quad-Core 64-bit processors (16 cores in
> all) as an SMP unit with 32GB of memory) and I'm observing some behaviour I
> don't understand.
> 
> I'm using PETSC_COMM_SELF in order to construct the same matrix on each
> processor (and solve the system with a different right-hand side vector on
> each processor), and when each linear system is around 315x315 (block-sparse),
> then each linear system is solved very quickly on each processor (approx
> 7x10^{-4} seconds), but when I increase the size of the linear system to
> 350x350 (or larger), the linear solves completely stall. I've tried a number
> of different solvers and preconditioners, but nothing seems to help. Also,
> this code has worked very well on other machines, although the machines I have
> used it on before have not had this architecture in which each node is an SMP
> unit. I was wondering if you have observed this kind of issue before?
> 
> I'm using PETSc 2.3.3, compiled with the Intel 10.1 compiler.

I would sugest running the code in a debugger to determine the exact
location where the stall happens [with the minimum number of procs]

mpiexec -n 4 ./exe -start_in_debugger

By default the above tries to open xterms on the localhost - so to get
this working on the cluster - you might need proper
ssh-x11-portforwarding setup to the node, and then use the extra
command line option '-display'

[when the job kinda hangs - I would do ctrl-c in gdb and look at the
stack trace on each mpi-thread]

Satish