NetBSD port
Satish Balay
balay at mcs.anl.gov
Wed Dec 16 20:33:52 CST 2009
On Thu, 17 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote:
>
> >> That summary misses the whole point of the errors I am seeing.
> >>
> >> The code runs fine locally AND under Sun Grid Engine, if you only
> >> spawn TWO processes but not FOUR or EIGHT.
> >
> > Well the the 'np 2' runs could be scheduled on your local node [or a
> > single SMP remote node].
>
> Well, they "could be", yes: they are not though.
>
> Look, you need to trust me when I tell you things (except for version
> numbers, ha ha).
>
> I would not be bothering you if I had not looked into this to a
> reasonable extent before deciding to bother you.
>
> I am in control of where the jobs are running.
Are you saying the np 2 is on 2 different machines - and same with
np4,8?
And what is OpenMPI using to communicate between these? Sockets?
Infiniband? Something else? Is this a cluster - or a distributed
machine setup?
> > And I suspect there is something wrong in your OpenMPI+SunGridEngine
> > config thats triggering this problem.
>
> I am happy to accept that and I even suggested that might be the case.
>
> I am happy to go and look around the OpenMPI and SGE sources, if that
> turns out to be the case.
>
> However, I came to the PETSc list for some insight from the PETSc
> error messages.
>
> If they can confirm/reject the notion that it might be an SGE/OpenMPI
> issue and not a PETSc one then I will have gained information.
>
>
> > I don't know exactly how though..
>
> So far, nothing has been confirmed either way.
>
> > [the basic petsc examples are supporsed to work in any valid
> > MPI enviornment].
>
>
> I don't doubt for a minute that they are supposed too.
>
> I am also aware that few people are likley to be using this
> software stack on NetBSD and thus there may be some gaps in
> your map of "valid MPI environments".
>
>
> > ok - mpi is shared. Can you confirm that the exact same version of
> > openmpi is installed on all the nodes - and that there is no minor
> > version differences that could trigger this?
>
> Just take that as read.
>
> Are you saying that the error messages PETSc is throwing out ARE
> consistent with a slightly mis-matched MPI then ?
There is a SEGV trapped by PETSc error handler. It doesn't know
exactly where its hapenning. You'll have to run in a debugger to get
the exact location of this error and the stack trace. [I suspect the
segv is in OpenMPI code - but only a debugger can confirm/deny it]
Normally you could use -start_in_debugger with PETSc binary - assuming
the remote nodes can communicate to your xserver on the desktop
[directly or via ssh-portforwared-x11] to do this.
Satish
>
>
> I am building an OpenMPI with some debugging in at present. I'll get
> back to you once I have rolled it out across the nodes and have
> some more info.
>
> In the meantime, if you can think of anything I can tickle PETSc with,
> you being familiar with PETSC, so as to get some error messages that
> might tell you something, do let me know.
>
>
More information about the petsc-dev
mailing list