NetBSD port
Satish Balay
balay at mcs.anl.gov
Wed Dec 16 18:16:56 CST 2009
On Thu, 17 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote:
> > Ok - the code runs locally fine - but not on 'SunGridEngine'
> >
>
> Not Ok.
>
> That summary misses the whole point of the errors I am seeing.
>
> The code runs fine locally AND under Sun Grid Engine, if you only
> spawn TWO processes but not FOUR or EIGHT.
Well the the 'np 2' runs could be scheduled on your local node [or a
single SMP remote node]. So it could be that a different code path
within the mpi library gets used in 2 vs 4 case. [shared memory vs
tcp/some-other communication].
Perhaps you can get the nodefile list for each of these [2,4,8 proc]
runs and see how the 2-proc run differs. [petsc only]
And I suspect there is something wrong in your OpenMPI+SunGridEngine
config thats triggering this problem. I don't know exactly how
though.. [the basic petsc examples are supporsed to work in any valid
MPI enviornment].
> > Wrt SGE - what does it require from MPI. Is it MPI agnostic - or does
> > it need a perticular MPI to be used?
>
> It is more the other way around.
>
> OpenMPI has been compiled so as to be aware of SGE.
ok.
> But anyroad, what are the error messages, from PETSc, telling you
> is possibly going wrong here?
>>>>>>>>>
[2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,probably memory access out of range
[2]PETSC ERROR: [2] VecScatterCreateCommon_PtoS line 1699 src/vec/vec/utils/vpscat.c
[2]PETSC ERROR: [2] VecScatterCreate_PtoS line 1508 src/vec/vec/utils/vpscat.c
[2]PETSC ERROR: User provided function() line 0 in unknown directory unknown file
<<<<<<<<<
Well it says there was a SEGV - and it gives some approximate
location. It could be inside the MPI code in those routines listed
here. A run in a debugger will confirm the exact location. [assuming
this can be done on this SGE]
>>>>>>>>>>
0]PETSC ERROR: Out of memory. This could be due to allocating
[0]PETSC ERROR: too large an object or bleeding by not properly
[0]PETSC ERROR: destroying unneeded objects.
[0]PETSC ERROR: Memory allocated 90628 Memory used by process 0
[0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info.
[0]PETSC ERROR: Memory requested 320!
<<<<<<<<<<<
Malloc failing at this low memory allocation? Something else is going
wrong here.
> > BTW: what do you have for 'ldd ex19'?
>
> $ldd ex19
> ex19:
> -lc.12 => /usr/lib/libc.so.12
> -lXau.6 => /usr/pkg/lib/libXau.so.6
> -lXdmcp.6 => /usr/pkg/lib/libXdmcp.so.6
> -lX11.6 => /usr/pkg/lib/libX11.so.6
> -lltdl.3 => /usr/pkg/lib/libltdl.so.3
> -lutil.7 => /usr/lib/libutil.so.7
> -lm.0 => /usr/lib/libm.so.0
> -lpthread.0 => /usr/lib/libpthread.so.0
> -lopen-pal.0 => /usr/pkg/lib/libopen-pal.so.0
> -lopen-rte.0 => /usr/pkg/lib/libopen-rte.so.0
> -lmpi.0 => /usr/pkg/lib/libmpi.so.0
> -lmpi_f77.0 => /usr/pkg/lib/libmpi_f77.so.0
> -lstdc++.6 => /usr/lib/libstdc++.so.6
> -lgcc_s.1 => /usr/lib/libgcc_s.so.1
> -lmpi_cxx.0 => /usr/pkg/lib/libmpi_cxx.so.0
ok - mpi is shared. Can you confirm that the exact same version of
openmpi is installed on all the nodes - and that there is no minor
version differences that could trigger this?
Satish
More information about the petsc-dev
mailing list