bombing out writing large scratch files

Randall Mackie randy at geosystem.us
Sat May 27 19:38:56 CDT 2006


This is frustrating. Now I'm getting this Message:


[randy at cluster Delta_May06]$ ./cmd_inv_petsc_3_3
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
Warning: No xauth data; using fake authentication data for X11 forwarding.
------------------------------------------------------------------------
Petsc Release Version 2.3.1, Patch 13, Wed May 10 11:08:35 CDT 2006
BK revision: balay at asterix.mcs.anl.gov|ChangeSet|20060510160640|13832
See docs/changes/index.html for recent updates.
See docs/faq.html for hints about trouble shooting.
See docs/index.html for manual pages.
------------------------------------------------------------------------
/home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc on a linux-gnu named cluster.geo.hpc by randy Sun May 28 02:26:04 2006
Libraries linked from /home/randy/SPARSE/petsc-2.3.1-p13/lib/linux-gnu
Configure run at Sun May 28 00:21:24 2006
Configure options --with-fortran --with-fortran-kernels=generic --with-blas-lapack-dir=/opt/intel/mkl72cluster/lib/32 
--with-scalar-type=complex --with-debugging=1 --with-mpi-dir=/opt/mpich/intel --with-shared=0
------------------------------------------------------------------------
[0]PETSC ERROR: PETSC: Attaching gdb to /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc of
pid 4861 on display 24.5.142.138:0.0 on machine cluster.geo.hpc



I noticed, however, that in /opt/mpich/intel/bin, there are other mpirun commands,
like mpirun_dbg.gdb, so I'll give that a try and see if that helps.

Randy


Barry Smith wrote:
> 
>   That's the correct message; it is trying to do the right
> thing. My guess is that the cluster node doesn't have a
> path back to your display. If, for example, the MPI jobs
> are not started up with ssh with X forwarding.
> 
>   What happens when you do
> mpirun -np 1 -nolocal -machinefile machines xterm -display  
> 24.5.142.138:0.0
> does it open an xterm on your system?
> 
>   BTW: it is -debugger_nodes 1 not -debugger_node 1
> 
> How about not using the -nolocal and use -debugger_nodes 0?
> 
>    Barry
> 
> 
> On Sat, 27 May 2006, Randall Mackie wrote:
> 
>> I can't seem to get the debugger to pop up on my screen.
>>
>> When I'm logged into the cluster I'm working on, I can
>> type xterm &, and an xterm pops up on my display. So I know
>> I can get something from the remote cluster.
>>
>> Now, when I try this using PETSc, I'm getting the following error
>> message, for example:
>>
>> ------------------------------------------------------------------------
>> [17]PETSC ERROR: PETSC: Attaching gdb to 
>> /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc of pid 3628 on display 
>> 24.5.142.138:0.0 on machine compute-0-23.local
>> ------------------------------------------------------------------------
>>
>> I'm using this in my command file:
>>
>> source ~/.bashrc
>> time /opt/mpich/intel/bin/mpirun -np 20 -nolocal -machinefile machines \
>>         /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc \
>>         -start_in_debugger \
>>         -debugger_node 1 \
>>         -display 24.5.142.138:0.0 \
>>         -em_ksp_type bcgs \
>>         -em_sub_pc_type ilu \
>>         -em_sub_pc_factor_levels 8 \
>>         -em_sub_pc_factor_fill 4 \
>>         -em_sub_pc_factor_reuse_ordering \
>>         -em_sub_pc_factor_reuse_fill \
>>         -em_sub_pc_factor_mat_ordering_type rcm \
>>         -divh_ksp_type cr \
>>         -divh_sub_pc_type icc \
>>         -ppc_sub_pc_type ilu \
>> << EOF
>> ...
>>
>>
>> Randy
>>
>>
>> Matthew Knepley wrote:
>>> 1) Make sure ssh is forwarding X (-Y I think)
>>>
>>> 2) -start_in_debugger
>>>
>>> 3) -display <your machine>:0.0
>>>
>>> should do it.
>>>
>>>    Matt
>>>
>>> On 5/27/06, *Randall Mackie* <randy at geosystem.us 
>>> <mailto:randy at geosystem.us>> wrote:
>>>
>>>     This is a stupid question, but how do I start in the debugger if I'm
>>>     running
>>>     on a cluster half-way around the world and I'm working on that 
>>> cluster
>>>     via ssh?
>>>
>>>     Randy
>>>
>>>
>>>     Matthew Knepley wrote:
>>>      > The best thing to do here is get a stack trace from the debugger.
>>>     From the
>>>      > description, it is hard to tell what statement is trying to
>>>     access which
>>>      > illegal
>>>      > memory.
>>>      >
>>>      >    Matt
>>>      >
>>>      > On 5/27/06, *Randall Mackie* < randy at geosystem.us
>>>     <mailto:randy at geosystem.us>
>>>      > <mailto:randy at geosystem.us <mailto:randy at geosystem.us>>> wrote:
>>>      >
>>>      >     In my PETSc based modeling code, I write out intermediate
>>>     results to
>>>      >     a scratch
>>>      >     file, and then read them back later. This has worked fine up
>>>     until
>>>      >     today,
>>>      >     when for a large model, this seems to be causing my program
>>>     to crash
>>>      >     with
>>>      >     errors like:
>>>      >
>>>      > 
>>> ------------------------------------------------------------------------
>>>      >     [9]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>      >     Violation, probably memory access out of range
>>>      >
>>>      >
>>>      >     I've tracked down the offending code to:
>>>      >
>>>      >                IF (rank == 0) THEN
>>>      >                  irec=(iper-1)*2+ipol
>>>      >                  write(7,rec=irec) (xvec(i),i=1,np)
>>>      >                END IF
>>>      >
>>>      >     It writes out xvec for the first record, but then on the 
>>> second
>>>      >     record my program is crashing.
>>>      >
>>>      >     The record length (from an inquire statement) is  recl 
>>> 22626552
>>>      >
>>>      >     The size of the scratch file when my program crashes is 98M.
>>>      >
>>>      >     PETSc is compiled using the intel compilers ( v9.0 for 
>>> fortran),
>>>      >     and the users manual says that you can have record lengths of
>>>      >     up to 2 billion bytes.
>>>      >
>>>      >     I'm kind of stuck as to what might be the cause. Any ideas
>>>     from anyone
>>>      >     would be greatly appreciated.
>>>      >
>>>      >     Randy Mackie
>>>      >
>>>      >     ps. I've tried both the optimized and debugging versions of
>>>     the PETSc
>>>      >     libraries, with the same result.
>>>      >
>>>      >
>>>      >     --
>>>      >     Randall Mackie
>>>      >     GSY-USA, Inc.
>>>      >     PMB# 643
>>>      >     2261 Market St.,
>>>      >     San Francisco, CA 94114-1600
>>>      >     Tel (415) 469-8649
>>>      >     Fax (415) 469-5044
>>>      >
>>>      >     California Registered Geophysicist
>>>      >     License No. GP 1034
>>>      >
>>>      >
>>>      >
>>>      >
>>>      > --
>>>      > "Failure has a thousand explanations. Success doesn't need one"
>>>     -- Sir
>>>      > Alec Guiness
>>>
>>>     --
>>>     Randall Mackie
>>>     GSY-USA, Inc.
>>>     PMB# 643
>>>     2261 Market St.,
>>>     San Francisco, CA 94114-1600
>>>     Tel (415) 469-8649
>>>     Fax (415) 469-5044
>>>
>>>     California Registered Geophysicist
>>>     License No. GP 1034
>>>
>>>
>>>
>>>
>>> -- 
>>> "Failure has a thousand explanations. Success doesn't need one" -- 
>>> Sir Alec Guiness
>>
>>
> 

-- 
Randall Mackie
GSY-USA, Inc.
PMB# 643
2261 Market St.,
San Francisco, CA 94114-1600
Tel (415) 469-8649
Fax (415) 469-5044

California Registered Geophysicist
License No. GP 1034




More information about the petsc-users mailing list