Making an anndata object to predict ATAC values at per-base or bin level #83

mtvector · 2024-12-23T08:47:21Z

Description of feature

I'm looking at your import_bigwigs function, and thinking about how to go beyond ATAC value summary statistics like 'mean', 'max', 'count', or 'logcount'. I assumed it would be as easy as adding the below to the _extract_values_from_bigwig i/o function:

    elif target == "raw":
        with pybigtools.open(bw_file, "r") as bw:
            values = np.vstack(
                [
                    np.array(
                        bw.values(chrom, int(start), int(end), missing=np.nan, exact=True)
                    )
                    for chrom, start, end in [
                        line.split("\t")[:3]
                        for line in open(temp_bed_file.name).readlines()
                    ]
                ]
            )

However on reflection, predicting at a per-base or bin level means the AnnData.X would need to be 3-dimensional, and I don't see a way to work around anndata's limitations there.

Am I missing something? Is there no reason to try to ATAC values at a per-base or per-bin level? Is there some other work-around you have in crested that I've missed?

Thanks for your input :)

Matthew

The text was updated successfully, but these errors were encountered:

LukasMahieu · 2024-12-24T14:02:30Z

Hey Matthew,

You're not missing anything, it's simply not implemented (yet). Currently CREsted only works with sequence to scalar predictions.

You're right, you run into problems with the anndata dimensionality but you can get around these in a couple of different ways (storing in adata layers instead, wrapping the anndata, using mudata, etc...). You'll notice quickly though that this is only the tip of the iceberg, since you'll also run into issues with the dataloaders, calculation of the contribution scores, etc further along the CREsted pipeline.

It's definitely useful though which is why we're actively working on this exact feature, but I can't say yet when we would be releasing this to the public.

mtvector · 2024-12-28T17:55:58Z

That's great, I did some thinking about how to achieve this in a seamless manner. My thought is to create a custom class which spoofs anndata into thinking it's a 2D matrix of the correct size, but when an element indexed, it retrieves row elements from a hdf5 backed matrix (might need anndata shadows or something to prevent adata.uns['tracks'] from loading into memory), effectively using lazy matrix as the third dimension of an on-disk tensor.

import numpy as np

class LazyMatrix(np.ndarray):
    """
    A lazy proxy matrix that dynamically retrieves data slices from a source matrix.

    LazyMatrix behaves like a 2D matrix, but the data for each "row" corresponds to slices
    from a specified source matrix, effectively treating it as the 3rd dimension of a 3D tensor.

    Parameters
    ----------
    source : object
        The data source to retrieve matrix rows from. Must support NumPy-style indexing.
    shape : tuple
        The shape of the proxy matrix. The first two dimensions correspond to the "virtual" 2D matrix,
        while the third dimension corresponds to the number of columns in the source matrix.

    Attributes
    ----------
    source : object
        The data source providing the matrix rows (e.g., HDF5 dataset, NumPy array).
    shape : tuple
        The shape of the virtual 3D tensor.

    Examples
    --------
    >>> import numpy as np
    >>> source = np.random.rand(5, 10)  # 5 rows, 10 features
    >>> lazy_matrix = LazyMatrix(source, shape=(5, 5, 10))  # Virtual 3D matrix
    >>> print(lazy_matrix[0, 1].shape)  # Shape (10,) - Row 1 of the source matrix
    >>> print(lazy_matrix[0:2, 1:3].shape)  # Shape (2, 2, 10) - Slices from the source matrix
    """
    def __new__(cls, source, shape=None):
        # Create a new LazyMatrix instance
        obj = super().__new__(cls, (), dtype=float)  # Pass an empty shape to defer initialization
        obj.source = source  # The data source (e.g., HDF5 dataset)
        obj._lazy_shape = shape if shape else (source.shape[0], source.shape[0], source.shape[1])
        return obj

    def __array_finalize__(self, obj):
        if obj is None:
            return
        self.source = getattr(obj, 'source', None)
        self._lazy_shape = getattr(obj, '_lazy_shape', None)

    def __getitem__(self, index):
        """
        Index into the proxy matrix, retrieving slices from the source matrix.

        Returns a 3D tensor where the third dimension corresponds to rows of the source matrix.
        """
        # Normalize indices for 2D access
        if isinstance(index, tuple):
            row_indices, col_indices = index
        else:
            raise IndexError("LazyMatrix expects 2D indexing, e.g., lazy_matrix[i, j].")

        # Handle slicing for row and column indices
        if isinstance(row_indices, slice):
            row_indices = np.arange(*row_indices.indices(self._lazy_shape[0]))
        else:
            row_indices = np.atleast_1d(row_indices)

        if isinstance(col_indices, slice):
            col_indices = np.arange(*col_indices.indices(self._lazy_shape[1]))
        else:
            col_indices = np.atleast_1d(col_indices)

        # Create the result tensor
        result = np.empty((len(row_indices), len(col_indices), self.source.shape[1]))
        for i, row_idx in enumerate(row_indices):
            for j, col_idx in enumerate(col_indices):
                result[i, j, :] = self.source[row_idx]

        return result

    @property
    def shape(self):
        return self._lazy_shape

    def __repr__(self):
        return f"<LazyMatrix shape={self._lazy_shape}>"

Not sure if that is helpful, but thought I'd share because it seems like an interesting problem.

LukasMahieu · 2025-01-03T09:11:09Z

Nice, that's a quite elegant solution. I need to look at this a bit more in depth and compare to how we we're implementing it now. I'll get back to you soon on this, thanks for sharing!

mtvector added the enhancement New feature or request label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making an anndata object to predict ATAC values at per-base or bin level #83

Making an anndata object to predict ATAC values at per-base or bin level #83

mtvector commented Dec 23, 2024

LukasMahieu commented Dec 24, 2024

mtvector commented Dec 28, 2024 •

edited

Loading

LukasMahieu commented Jan 3, 2025

Making an anndata object to predict ATAC values at per-base or bin level #83

Making an anndata object to predict ATAC values at per-base or bin level #83

Comments

mtvector commented Dec 23, 2024

Description of feature

LukasMahieu commented Dec 24, 2024

mtvector commented Dec 28, 2024 • edited Loading

LukasMahieu commented Jan 3, 2025

mtvector commented Dec 28, 2024 •

edited

Loading