bombing out writing large scratch files

Satish Balay balay at mcs.anl.gov
Sun May 28 09:39:33 CDT 2006


 - Not sure what SIGUSR1 means in this context.

 - The stack doesn't show any PETSc/user code. Was
 this code compiled with debug version of PETSc?

- it could be that gdb is unable to look at intel compilers stack
  [normally gdb should work]. If thats the case - you could run with
  '-start_in_debugger idb'

- It appears that this breakage is from usercode which calls fortran
  I/O [for_write_dir_xmit()]. There is no fortran I/O from PETSc side
  of the code. I think it could still be a bug in the usercode.

However PETSc does try to detect the availability of
_intel_fast_memcpy() and use it from C side. I don't think this is the
cause. But to verify you could remove the flag
PETSC_HAVE__INTEL_FAST_MEMCPY from petscconf.h and rebuild libraries.

Satish


On Sun, 28 May 2006, Randall Mackie wrote:

> Satish,
> 
> Thanks, using method (2) worked. However, when I run a bt in gdb,
> I get the following output:
> 
> Loaded symbols for /lib/libnss_files.so.2
> 0x080b2631 in d3inv_3_3 () at d3inv_3_3.F:2063
> 2063          call VecAssemblyBegin(xyz,ierr)
> (gdb) cont
> Continuing.
> 
> Program received signal SIGUSR1, User defined signal 1.
> [Switching to Thread 1082952160 (LWP 23496)]
> 0x088cd729 in _intel_fast_memcpy.J ()
> Current language:  auto; currently fortran
> (gdb) bt
> #0  0x088cd729 in _intel_fast_memcpy.J ()
> #1  0x40620628 in for_write_dir_xmit ()
>    from /opt/intel_fc_80/lib/libifcore.so.5
> #2  0xbfffa6b0 in ?? ()
> #3  0x00000008 in ?? ()
> #4  0xbfff986c in ?? ()
> #5  0xbfff9890 in ?? ()
> #6  0x406873a8 in __dtors_list_end () from /opt/intel_fc_80/lib/libifcore.so.5
> #7  0x00000002 in ?? ()
> #8  0x00000000 in ?? ()
> (gdb)
> 
> This all makes me think this is an INTEL compiler bug, and has nothing to
> do with my code.
> 
> Any ideas?
> 
> Randy
> 
> 
> Satish Balay wrote:
> > Looks like you have direct access to all the cluster nodes. Perhaps
> > you have admin access? You can do either of the following:
> > 
> >  * if the cluster frontend/compute nodes have common filesystem [i.e
> >  all machines can see the same file for ~/.Xauthority] and you can get
> >  'sshd' settings on the frontend changed - then:
> > 
> >  - configure sshd with 'X11UseLocalhost no' - this way xterms on the
> >    compute-nodes can connect to the 'ssh-x11' port on the frontend  - run
> > the PETSc app with: '-display frontend:ssh-x11-port'
> > 
> >  * However if the above is not possible - but you can ssh directly to
> >   all the the compute nodes [perhaps from the frontend] then you can
> >   cascade x11 forwarding with:
> > 
> >  - ssh from desktop to frontend
> >  - ssh from frontend to node-9 [if you know which machine is node9
> >    from the machine file.]
> >  - If you don't know which one is the node-9 - then ssh from frontend
> >    to all the nodes :). Mostlikely all nodes will get a display
> > 'localhost:l0.0'
> >  - so now you can run the executable with the option
> >        -display localhost:10.0
> > 
> > The other alternative that might work [for interactive runs] is:
> > 
> > -start_in_debugger noxterm -debugger_nodes 9
> > 
> > Satish
> > 
> > On Sat, 27 May 2006, Randall Mackie wrote:
> > 
> > > I can't seem to get the debugger to pop up on my screen.
> > > 
> > > When I'm logged into the cluster I'm working on, I can
> > > type xterm &, and an xterm pops up on my display. So I know
> > > I can get something from the remote cluster.
> > > 
> > > Now, when I try this using PETSc, I'm getting the following error
> > > message, for example:
> > > 
> > > ------------------------------------------------------------------------
> > > [17]PETSC ERROR: PETSC: Attaching gdb to
> > > /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc of pid 3628 on display
> > > 24.5.142.138:0.0 on machine compute-0-23.local
> > > ------------------------------------------------------------------------
> > > 
> > > I'm using this in my command file:
> > > 
> > > source ~/.bashrc
> > > time /opt/mpich/intel/bin/mpirun -np 20 -nolocal -machinefile machines \
> > >          /home/randy/d3inv/PETSC_V3.3/d3inv_3_3_petsc \
> > >          -start_in_debugger \
> > >          -debugger_node 1 \
> > >          -display 24.5.142.138:0.0 \
> > >          -em_ksp_type bcgs \
> > >          -em_sub_pc_type ilu \
> > >          -em_sub_pc_factor_levels 8 \
> > >          -em_sub_pc_factor_fill 4 \
> > >          -em_sub_pc_factor_reuse_ordering \
> > >          -em_sub_pc_factor_reuse_fill \
> > >          -em_sub_pc_factor_mat_ordering_type rcm \
> > >          -divh_ksp_type cr \
> > >          -divh_sub_pc_type icc \
> > >          -ppc_sub_pc_type ilu \
> > > << EOF
> > 
> 
> 




More information about the petsc-users mailing list