[petsc-users] Problem with PETSc + HDF5 VecView

Matthew Knepley knepley at gmail.com
Tue Nov 25 11:47:43 CST 2014


On Mon, Nov 24, 2014 at 1:10 PM, Håkon Strandenes <haakon at hakostra.net>
wrote:

> Hi,
>
> I have some problems with PETSc and HDF5 VecLoad/VecView. The VecLoad
> problems can rest for now, but the VecView are more serious.
>
> In short: I have a 3D DMDA with and some vectors that I want to save to a
> HDF5 file. This works perfectly on my workstation, but not on the compute
> cluster I have access to. I have attached a typical error message.
>
> I have also attached an piece of code that can trigger the error. The code
> is merely a 2D->3D rewrite of DMDA ex 10 (http://www.mcs.anl.gov/petsc/
> petsc-current/src/dm/examples/tutorials/ex10.c.html), nothing else is
> done.
>
> The program typically works on small number of processes. I have
> successfully executed the attached program on up to 32 processes. That
> works. Always. I have never had a single success when trying to run on 64
> processes. Always same error.
>
> The computer I am struggling with is an SGI machine with SLES 11sp1 and
> Intel CPUs, hence I have used Intels compilers. I have tried both 2013,
> 2014 and 2015 versions of the compilers, so that's probably not the cause.
> I have also tried GCC 4.9.1, just to be safe, same error there. The same
> compiler is used for both HDF5 and PETSc. The same error message occurs for
> both debug and release builds. I have tried HDF5 versions 1.8.11 and
> 1.8.13. I have tried PETSc version 3.4.1 and the latest from Git. The MPI
> implementation on the machine is SGI's MPT, and i have tried both 2.06 and
> 2.10. Always same error. Other MPI implementations is unfortunately not
> available.
>
> What really drives me mad is that this works like a charm on my
> workstation with Linux Mint... I have successfully executed the attached
> example on 254 processes (my machine breaks down if I try anything more
> than that).
>
> Does any of you have any tips on how to attack this problem and find out
> what's wrong?
>

This does sound like a pain to track down. It seems to be complaining about
an MPI datatype:

#005: H5Dmpio.c line 998 in H5D__link_chunk_collective_io():
MPI_Type_struct failed
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
  #006: H5Dmpio.c line 998 in H5D__link_chunk_collective_io(): Invalid
datatype argument
    major: Internal error (too specific to document in detail)
    minor: MPI Error String

In this call, we pass in 'scalartype', which is H5T_NATIVE_DOUBLE (unless
you configured for
single precision). This was used successfully to create the dataspace, so
it is unlikely to be
the problem. I am guessing that HDF5 creates internal MPI datatypes to use
in the MPI/IO
routines (maybe using MPI_Type_struct).

I believe we have seen type creation routines fail in some MPI
implementations if you try to
create too many of them. Right now, this looks a lot like a bug in MPT,
although it might be
an HDF5 bug with forgetting to release MPI types that they do not need.

  Thanks,

    Matt


> Regards,
> Håkon Strandenes
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20141125/e49055ec/attachment.html>


More information about the petsc-users mailing list