NetBSD port

Satish Balay balay at mcs.anl.gov
Wed Dec 16 18:16:56 CST 2009


On Thu, 17 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote:

> > Ok - the code runs locally fine - but not on  'SunGridEngine'
> >
> 
> Not Ok.
> 
> That summary misses the whole point of the errors I am seeing.
> 
> The code runs fine locally AND under Sun Grid Engine, if you only
> spawn TWO processes but not FOUR or EIGHT.

Well the the 'np 2' runs could be scheduled on your local node [or a
single SMP remote node]. So it could be that a different code path
within the mpi library gets used in 2 vs 4 case. [shared memory vs
tcp/some-other communication].

Perhaps you can get the nodefile list for each of these [2,4,8 proc]
runs and see how the 2-proc run differs. [petsc only]

And I suspect there is something wrong in your OpenMPI+SunGridEngine
config thats triggering this problem. I don't know exactly how
though.. [the basic petsc examples are supporsed to work in any valid
MPI enviornment].

> > Wrt SGE - what does it require from MPI. Is it MPI agnostic - or does
> > it need a perticular MPI to be used?
> 
> It is more the other way around.
> 
> OpenMPI has been compiled so as to be aware of SGE.

ok.

> But anyroad, what are the error messages, from PETSc, telling you
> is possibly going wrong here?

>>>>>>>>>
[2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,probably memory access out of range

[2]PETSC ERROR: [2] VecScatterCreateCommon_PtoS line 1699 src/vec/vec/utils/vpscat.c
[2]PETSC ERROR: [2] VecScatterCreate_PtoS line 1508 src/vec/vec/utils/vpscat.c

[2]PETSC ERROR: User provided function() line 0 in unknown directory unknown file
<<<<<<<<<

Well it says there was a SEGV - and it gives some approximate
location. It could be inside the MPI code in those routines listed
here. A run in a debugger will confirm the exact location. [assuming
this can be done on this SGE]

>>>>>>>>>>
0]PETSC ERROR: Out of memory. This could be due to allocating
[0]PETSC ERROR: too large an object or bleeding by not properly
[0]PETSC ERROR: destroying unneeded objects.
[0]PETSC ERROR: Memory allocated 90628 Memory used by process 0
[0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.
[0]PETSC ERROR: Memory requested 320!
<<<<<<<<<<<

Malloc failing at this low memory allocation? Something else is going
wrong here.


> > BTW: what do you have for 'ldd ex19'?
> 
> $ldd ex19
> ex19:
>         -lc.12 => /usr/lib/libc.so.12
>         -lXau.6 => /usr/pkg/lib/libXau.so.6
>         -lXdmcp.6 => /usr/pkg/lib/libXdmcp.so.6
>         -lX11.6 => /usr/pkg/lib/libX11.so.6
>         -lltdl.3 => /usr/pkg/lib/libltdl.so.3
>         -lutil.7 => /usr/lib/libutil.so.7
>         -lm.0 => /usr/lib/libm.so.0
>         -lpthread.0 => /usr/lib/libpthread.so.0
>         -lopen-pal.0 => /usr/pkg/lib/libopen-pal.so.0
>         -lopen-rte.0 => /usr/pkg/lib/libopen-rte.so.0
>         -lmpi.0 => /usr/pkg/lib/libmpi.so.0
>         -lmpi_f77.0 => /usr/pkg/lib/libmpi_f77.so.0
>         -lstdc++.6 => /usr/lib/libstdc++.so.6
>         -lgcc_s.1 => /usr/lib/libgcc_s.so.1
>         -lmpi_cxx.0 => /usr/pkg/lib/libmpi_cxx.so.0

ok - mpi is shared. Can you confirm that the exact same version of
openmpi is installed on all the nodes - and that there is no minor
version differences that could trigger this?

Satish



More information about the petsc-dev mailing list