[petsc-dev] BG hang still broken in petsc-maint!

Wed Dec 18 13:45:32 CST 2013

On Wed, 18 Dec 2013, Jed Brown wrote:

> Satish Balay <balay at mcs.anl.gov> writes:
> 
> > Works for me on vesta with [the following on sys/examples/tutorials/ex1]
> >
> > runjob --np 8192  --ranks-per-node 16 --cwd $PWD --block  VST-00440-33771-512 : $PWD/ex1 -log_summary
> 
> This is only 512 nodes.  According to ALCF, the probability of MPI_Bcast
> crossing paths goes way up at more than 1024 nodes.  IBM should really
> fix this problem, but until then, the workaround is to fall back to the
> reference implementations (PAMID_COLLECTIVES=0) which are sometimes
> also faster (go figure).

I had a chat with Derek today morning. The error case was with 512
nodes [same as above] with --ranks-per-node 4 or 8. And this was on
ceatus. The hang was confirmed to be in PetscInitialze [via the
debugger] and -skip_petscrc went past the hang.

Will try reproducing the problem on ceatus.

Satish