Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze trying to read compound datasets with variable length strings #251

Open
geolehmann opened this issue Jul 14, 2023 · 9 comments
Open

Comments

@geolehmann
Copy link

geolehmann commented Jul 14, 2023

Hi everybody,

I have problems trying to read compound datasets which also consist of strings with variable lengths. While I have no problems reading other types like floats or integers from a compound, my applications freezes completely when I try to read the strings. Interestingly, I can read normal string datasets, the problem only occurs for compound datasets with strings (using VarLenAscii/VarLenUnicode/VarLenArray).

I am using the h5-types crate with the "h5_alloc" feature enabled under Windows 10 with version 1.14.0 of the HDF library.

This is the relevant code I use for loading the dataset:

#[derive(H5Type, Debug, Clone, PartialEq)]
#[repr(C)]
pub struct Index {
    pub start_index: u32,
    pub size: u32,
    pub object_ID: hdf5::types::VarLenUnicode,
    pub data_ID: hdf5::types::VarLenUnicode,
}

let index_dataset = file.dataset(&path).unwrap();
let index_data = index_dataset.read_1d::<h5well::Index>();

In the screenshot below is the structure of the dataset from HDFView, which I try to load:
image

I tried to hunt down the problem and it seems to be somewhere in the "read_into_buf" function, but I am stuck now. Did anybody encounter a similar issue or can point me in the right direction? Thanks in advance for any help!

@mulimoen
Copy link
Collaborator

read_into_buf is suggesting that the hdf5 library is doing some work or locking up. Do you have a debugger available to obtain a stacktrace? Does reading one element (Index) finish?

If the dataset is openly available I could check if this can be reproduced on linux and debug it further.

@geolehmann
Copy link
Author

Yes, I tried reading only one element, but that did not work either. Here is a minimum example file, containing only the compound dataset: https://drive.google.com/file/d/1CJeFNq84Z_lfThG1r75NsQKv2kv4Are8/view?usp=sharing

About the stacktrace - since the program does not crash, I probably would need to manually obtain a stacktrace at some point? I set one at the end of the read_raw function, since I now observed that the freeze actually happens there returning the result of the read_into_buf function - the relevant part of the trace looks like this:

}, {
     fn: "hdf5::hl::container::Reader::read_raw",
     file: "D:\dev\hdf5-rust-master\hdf5\src\hl\container.rs",
     line: 164
 }, {
     fn: "hdf5::hl::container::Reader::read",
     file: "D:\dev\hdf5-rust-master\hdf5\src\hl\container.rs",
     line: 140
 }, {
     fn: "hdf5::hl::container::Reader::read_1d",
     file: "D:\dev\hdf5-rust-master\hdf5\src\hl\container.rs",
     line: 173
 }, {
     fn: "hdf5::hl::container::Container::read_1d",
     file: "D:\dev\hdf5-rust-master\hdf5\src\hl\container.rs",
     line: 600
 }, {
     fn: "k4::loader_geoh5::load",
     file: ".\src\loader_geoh5.rs",
     line: 157
 }, {

@mulimoen
Copy link
Collaborator

It seems the strings as returned as nullpointers which causes issues (and should be fixed!). I think this specific issue can be fixed by providing the proper names to the members to match what is in the file with a rename, i.e.

#[derive(H5Type, Debug, Clone, PartialEq)]
#[repr(C)]
pub struct Index {
    #[hdf5(rename = "Start index")]
    pub start_index: u32,
    #[hdf5(rename = "Size")]
    pub size: u32,
    #[hdf5(rename = "Object ID")]
    pub object_ID: hdf5::types::VarLenUnicode,
    #[hdf5(rename = "Data ID")]
    pub data_ID: hdf5::types::VarLenUnicode,
}

@geolehmann
Copy link
Author

It works - a thousand thanks for your fast help!! I was not aware of the rename helper attribute, I should have read the changelog....

@mulimoen
Copy link
Collaborator

We should do something about the freeze and the segfault, reopening as a reminder

@mulimoen mulimoen reopened this Jul 15, 2023
@kasparthommen
Copy link

Has there been any work on this? I have the same issue, and because I'm loading an HDF5 file with an unknown schema I can't use the name trick proposed by @mulimoen

@mulimoen
Copy link
Collaborator

All new work is being done in the fork #295, but nothing on this issue as far as I know. Feel free to open an issue in the forked repo with an MVP

@kasparthommen
Copy link

Just a note: After removing the lzf compression using h5py in Python (by copying the file's datasets one by one to a new file) I was able to read the updated file just fine using this library - no more null pointers! This could indicate that the bug might be related to lzf compression or HDF5 compression in general.

@kasparthommen
Copy link

Another note on this: when re-creating the file using h5py and specifying lzf compression again, then this library crashes again, so it definitely seems to be associated with lzf compression somehow. When re-compressing with gzip all works fine, which reinforces my conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants