bombing out writing large scratch files

Randall Mackie randy at geosystem.us
Sun May 28 12:00:57 CDT 2006


Thanks to everybody who helped me struggle with this problem. I've
learned a lot about debugging MPI programs on a cluster half-way
around the world.

It turns out that the problem was not a bug in the sense that I
had written x(i) when it should have been x(i-1), for example.

Rather, in one of my subroutines, I was using automatic arrays,
and I believe I was bumping up against the hard limit for stack
memory (where automatic arrays are put).

I've rewritten the code to make those allocatable arrays, and now
it all runs okay, although I realize I have some more reprogramming
to do to properly do it in parallel, but at least that problem
is solved.

Thanks again,

Randy

Satish Balay wrote:
>  - Not sure what SIGUSR1 means in this context.
> 
>  - The stack doesn't show any PETSc/user code. Was
>  this code compiled with debug version of PETSc?
> 
> - it could be that gdb is unable to look at intel compilers stack
>   [normally gdb should work]. If thats the case - you could run with
>   '-start_in_debugger idb'
> 
> - It appears that this breakage is from usercode which calls fortran
>   I/O [for_write_dir_xmit()]. There is no fortran I/O from PETSc side
>   of the code. I think it could still be a bug in the usercode.
> 
> However PETSc does try to detect the availability of
> _intel_fast_memcpy() and use it from C side. I don't think this is the
> cause. But to verify you could remove the flag
> PETSC_HAVE__INTEL_FAST_MEMCPY from petscconf.h and rebuild libraries.
> 
> Satish
> 
> 
> On Sun, 28 May 2006, Randall Mackie wrote:
> 
>> Satish,
>>
>> Thanks, using method (2) worked. However, when I run a bt in gdb,
>> I get the following output:
>>
>> Loaded symbols for /lib/libnss_files.so.2
>> 0x080b2631 in d3inv_3_3 () at d3inv_3_3.F:2063
>> 2063          call VecAssemblyBegin(xyz,ierr)
>> (gdb) cont
>> Continuing.
>>
>> Program received signal SIGUSR1, User defined signal 1.
>> [Switching to Thread 1082952160 (LWP 23496)]
>> 0x088cd729 in _intel_fast_memcpy.J ()
>> Current language:  auto; currently fortran
>> (gdb) bt
>> #0  0x088cd729 in _intel_fast_memcpy.J ()
>> #1  0x40620628 in for_write_dir_xmit ()
>>    from /opt/intel_fc_80/lib/libifcore.so.5
>> #2  0xbfffa6b0 in ?? ()
>> #3  0x00000008 in ?? ()
>> #4  0xbfff986c in ?? ()
>> #5  0xbfff9890 in ?? ()
>> #6  0x406873a8 in __dtors_list_end () from /opt/intel_fc_80/lib/libifcore.so.5
>> #7  0x00000002 in ?? ()
>> #8  0x00000000 in ?? ()
>> (gdb)
>>
>> This all makes me think this is an INTEL compiler bug, and has nothing to
>> do with my code.
>>
>> Any ideas?
>>
>> Randy
>>
>>
>> Satish Balay wrote:
>>> Looks like you have direct access to all the cluster nodes. Perhaps
>>> you have admin access? You can do either of the following:
>>>
>>>  * if the cluster frontend/compute nodes have common filesystem [i.e
>>>  all machines can see the same file for ~/.Xauthority] and you can get
>>>  'sshd' settings on the frontend changed - then:
>>>
>>>  - configure sshd with 'X11UseLocalhost no' - this way xterms on the
>>>    compute-nodes can connect to the 'ssh-x11' port on the frontend  - run
>>> the PETSc app with: '-display frontend:ssh-x11-port'
>>>
>>>  * However if the above is not possible - but you can ssh directly to
>>>   all the the compute nodes [perhaps from the frontend] then you can
>>>   cascade x11 forwarding with:
>>>
>>>  - ssh from desktop to frontend
>>>  - ssh from frontend to node-9 [if you know which machine is node9
>>>    from the machine file.]
>>>  - If you don't know which one is the node-9 - then ssh from frontend
>>>    to all the nodes :). Mostlikely all nodes will get a display
>>> 'localhost:l0.0'
>>>  - so now you can run the executable with the option
>>>        -display localhost:10.0
>>>
>>> The other alternative that might work [for interactive runs] is:
>>>
>>> -start_in_debugger noxterm -debugger_nodes 9
>>>
>>> Satish
>>>
>>> On Sat, 27 May 2006, Randall Mackie wrote:
>>>
>>>> I can't seem to get the debugger to pop up on my screen.
>>>>
>>>> When I'm logged into the cluster I'm working on, I can
>>>> type xterm &, and an xterm pops up on my display. So I know
>>>> I can get something from the remote cluster.
>>>>
>>>> Now, when I try this using PETSc, I'm getting the following error
>>>> message, for example:
>>>>
>>>> ------------------------------------------------------------------------
>>>> [17]PETSC ERROR: PETSC: Attaching gdb to
>>>> /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc of pid 3628 on display
>>>> 24.5.142.138:0.0 on machine compute-0-23.local
>>>> ------------------------------------------------------------------------
>>>>
>>>> I'm using this in my command file:
>>>>
>>>> source ~/.bashrc
>>>> time /opt/mpich/intel/bin/mpirun -np 20 -nolocal -machinefile machines \
>>>>          /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc \
>>>>          -start_in_debugger \
>>>>          -debugger_node 1 \
>>>>          -display 24.5.142.138:0.0 \
>>>>          -em_ksp_type bcgs \
>>>>          -em_sub_pc_type ilu \
>>>>          -em_sub_pc_factor_levels 8 \
>>>>          -em_sub_pc_factor_fill 4 \
>>>>          -em_sub_pc_factor_reuse_ordering \
>>>>          -em_sub_pc_factor_reuse_fill \
>>>>          -em_sub_pc_factor_mat_ordering_type rcm \
>>>>          -divh_ksp_type cr \
>>>>          -divh_sub_pc_type icc \
>>>>          -ppc_sub_pc_type ilu \
>>>> << EOF
>>
> 

-- 
Randall Mackie
GSY-USA, Inc.
PMB# 643
2261 Market St.,
San Francisco, CA 94114-1600
Tel (415) 469-8649
Fax (415) 469-5044

California Registered Geophysicist
License No. GP 1034




More information about the petsc-users mailing list