bombing out writing large scratch files
Randall Mackie
randy at geosystem.us
Sun May 28 09:46:23 CDT 2006
Satish,
Yes, PETSc was compiled in debug mode.
Since I'm simply storing vectors in a temporary file, could I get
around this by using VecView and writing each vector to the
same Viewer in binary format, then reading them later?
In other words:
do loop=1,n
call VecView (xvec(:,loop).....)
end do
then later
do loop=1,n
call VecLoad (xvec(:,loop)....)
end do
Randy
ps. I'll try your other suggestions as well. However, this code has worked
flawlessly until now, with a model much much larger than I've used in the
past.
Satish Balay wrote:
> - Not sure what SIGUSR1 means in this context.
>
> - The stack doesn't show any PETSc/user code. Was
> this code compiled with debug version of PETSc?
>
> - it could be that gdb is unable to look at intel compilers stack
> [normally gdb should work]. If thats the case - you could run with
> '-start_in_debugger idb']
>
> - It appears that this breakage is from usercode which calls fortran
> I/O [for_write_dir_xmit()]. There is no fortran I/O from PETSc side
> of the code. I think it could still be a bug in the usercode.
>
> However PETSc does try to detect the availability of
> _intel_fast_memcpy() and use it from C side. I don't think this is the
> cause. But to verify you could remove the flag
> PETSC_HAVE__INTEL_FAST_MEMCPY from petscconf.h and rebuild libraries.
>
> Satish
>
>
> On Sun, 28 May 2006, Randall Mackie wrote:
>
>> Satish,
>>
>> Thanks, using method (2) worked. However, when I run a bt in gdb,
>> I get the following output:
>>
>> Loaded symbols for /lib/libnss_files.so.2
>> 0x080b2631 in d3inv_3_3 () at d3inv_3_3.F:2063
>> 2063 call VecAssemblyBegin(xyz,ierr)
>> (gdb) cont
>> Continuing.
>>
>> Program received signal SIGUSR1, User defined signal 1.
>> [Switching to Thread 1082952160 (LWP 23496)]
>> 0x088cd729 in _intel_fast_memcpy.J ()
>> Current language: auto; currently fortran
>> (gdb) bt
>> #0 0x088cd729 in _intel_fast_memcpy.J ()
>> #1 0x40620628 in for_write_dir_xmit ()
>> from /opt/intel_fc_80/lib/libifcore.so.5
>> #2 0xbfffa6b0 in ?? ()
>> #3 0x00000008 in ?? ()
>> #4 0xbfff986c in ?? ()
>> #5 0xbfff9890 in ?? ()
>> #6 0x406873a8 in __dtors_list_end () from /opt/intel_fc_80/lib/libifcore.so.5
>> #7 0x00000002 in ?? ()
>> #8 0x00000000 in ?? ()
>> (gdb)
>>
>> This all makes me think this is an INTEL compiler bug, and has nothing to
>> do with my code.
>>
>> Any ideas?
>>
>> Randy
>>
>>
>> Satish Balay wrote:
>>> Looks like you have direct access to all the cluster nodes. Perhaps
>>> you have admin access? You can do either of the following:
>>>
>>> * if the cluster frontend/compute nodes have common filesystem [i.e
>>> all machines can see the same file for ~/.Xauthority] and you can get
>>> 'sshd' settings on the frontend changed - then:
>>>
>>> - configure sshd with 'X11UseLocalhost no' - this way xterms on the
>>> compute-nodes can connect to the 'ssh-x11' port on the frontend - run
>>> the PETSc app with: '-display frontend:ssh-x11-port'
>>>
>>> * However if the above is not possible - but you can ssh directly to
>>> all the the compute nodes [perhaps from the frontend] then you can
>>> cascade x11 forwarding with:
>>>
>>> - ssh from desktop to frontend
>>> - ssh from frontend to node-9 [if you know which machine is node9
>>> from the machine file.]
>>> - If you don't know which one is the node-9 - then ssh from frontend
>>> to all the nodes :). Mostlikely all nodes will get a display
>>> 'localhost:l0.0'
>>> - so now you can run the executable with the option
>>> -display localhost:10.0
>>>
>>> The other alternative that might work [for interactive runs] is:
>>>
>>> -start_in_debugger noxterm -debugger_nodes 9
>>>
>>> Satish
>>>
>>> On Sat, 27 May 2006, Randall Mackie wrote:
>>>
>>>> I can't seem to get the debugger to pop up on my screen.
>>>>
>>>> When I'm logged into the cluster I'm working on, I can
>>>> type xterm &, and an xterm pops up on my display. So I know
>>>> I can get something from the remote cluster.
>>>>
>>>> Now, when I try this using PETSc, I'm getting the following error
>>>> message, for example:
>>>>
>>>> ------------------------------------------------------------------------
>>>> [17]PETSC ERROR: PETSC: Attaching gdb to
>>>> /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc of pid 3628 on display
>>>> 24.5.142.138:0.0 on machine compute-0-23.local
>>>> ------------------------------------------------------------------------
>>>>
>>>> I'm using this in my command file:
>>>>
>>>> source ~/.bashrc
>>>> time /opt/mpich/intel/bin/mpirun -np 20 -nolocal -machinefile machines \
>>>> /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc \
>>>> -start_in_debugger \
>>>> -debugger_node 1 \
>>>> -display 24.5.142.138:0.0 \
>>>> -em_ksp_type bcgs \
>>>> -em_sub_pc_type ilu \
>>>> -em_sub_pc_factor_levels 8 \
>>>> -em_sub_pc_factor_fill 4 \
>>>> -em_sub_pc_factor_reuse_ordering \
>>>> -em_sub_pc_factor_reuse_fill \
>>>> -em_sub_pc_factor_mat_ordering_type rcm \
>>>> -divh_ksp_type cr \
>>>> -divh_sub_pc_type icc \
>>>> -ppc_sub_pc_type ilu \
>>>> << EOF
>>
>
--
Randall Mackie
GSY-USA, Inc.
PMB# 643
2261 Market St.,
San Francisco, CA 94114-1600
Tel (415) 469-8649
Fax (415) 469-5044
California Registered Geophysicist
License No. GP 1034
More information about the petsc-users
mailing list