NetBSD port

Wed Dec 16 20:33:52 CST 2009

On Thu, 17 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote:

> 
> >> That summary misses the whole point of the errors I am seeing.
> >>
> >> The code runs fine locally AND under Sun Grid Engine, if you only
> >> spawn TWO processes but not FOUR or EIGHT.
> >
> > Well the the 'np 2' runs could be scheduled on your local node [or a
> > single SMP remote node].
> 
> Well, they "could be", yes: they are not though.
> 
> Look, you need to trust me when I tell you things (except for version
> numbers, ha ha).
> 
> I would not be bothering you if I had not looked into this to a
> reasonable extent before deciding to bother you.
> 
> I am in control of where the jobs are running.

Are you saying the np 2 is on 2 different machines - and same with
np4,8?

And what is OpenMPI using to communicate between these? Sockets?
Infiniband? Something else? Is this a cluster - or a distributed
machine setup?

> > And I suspect there is something wrong in your OpenMPI+SunGridEngine
> > config thats triggering this problem.
> 
> I am happy to accept that and I even suggested that might be the case.
> 
> I am happy to go and look around the OpenMPI and SGE sources, if that
> turns out to be the case.
> 
> However, I came to the PETSc list for some insight from the PETSc
> error messages.
> 
> If they can confirm/reject the notion that it might be an SGE/OpenMPI
> issue and not a PETSc one then I will have gained information.
> 
> 
> > I don't know exactly how though..
> 
> So far, nothing has been confirmed either way.
> 
> > [the basic petsc examples are supporsed to work in any valid
> > MPI enviornment].
> 
> 
> I don't doubt for a minute that they are supposed too.
> 
> I am also aware that few people are likley to be using this
> software stack on NetBSD and thus there may be some gaps in
> your map of "valid MPI environments".
> 
> 
> > ok - mpi is shared. Can you confirm that the exact same version of
> > openmpi is installed on all the nodes - and that there is no minor
> > version differences that could trigger this?
> 
> Just take that as read.
> 
> Are you saying that the error messages PETSc is throwing out ARE
> consistent with a slightly mis-matched MPI then ?

There is a SEGV trapped by PETSc error handler. It doesn't know
exactly where its hapenning. You'll have to run in a debugger to get
the exact location of this error and the stack trace. [I suspect the
segv is in OpenMPI code - but only a debugger can confirm/deny it]

Normally you could use -start_in_debugger with PETSc binary - assuming
the remote nodes can communicate to your xserver on the desktop
[directly or via ssh-portforwared-x11] to do this.

Satish

> 
> 
> I am building an OpenMPI with some debugging in at present. I'll get
> back to you once I have rolled it out across the nodes and have
> some more info.
> 
> In the meantime, if you can think of anything I can tickle PETSc with,
> you being familiar with PETSC, so as to get some error messages that
> might tell you something, do let me know.
> 
>