Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making an anndata object to predict ATAC values at per-base or bin level #83

Open
mtvector opened this issue Dec 23, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@mtvector
Copy link

Description of feature

I'm looking at your import_bigwigs function, and thinking about how to go beyond ATAC value summary statistics like 'mean', 'max', 'count', or 'logcount'. I assumed it would be as easy as adding the below to the _extract_values_from_bigwig i/o function:

    elif target == "raw":
        with pybigtools.open(bw_file, "r") as bw:
            values = np.vstack(
                [
                    np.array(
                        bw.values(chrom, int(start), int(end), missing=np.nan, exact=True)
                    )
                    for chrom, start, end in [
                        line.split("\t")[:3]
                        for line in open(temp_bed_file.name).readlines()
                    ]
                ]
            )

However on reflection, predicting at a per-base or bin level means the AnnData.X would need to be 3-dimensional, and I don't see a way to work around anndata's limitations there.

Am I missing something? Is there no reason to try to ATAC values at a per-base or per-bin level? Is there some other work-around you have in crested that I've missed?

Thanks for your input :)

Matthew

@mtvector mtvector added the enhancement New feature or request label Dec 23, 2024
@LukasMahieu
Copy link
Collaborator

Hey Matthew,

You're not missing anything, it's simply not implemented (yet). Currently CREsted only works with sequence to scalar predictions.

You're right, you run into problems with the anndata dimensionality but you can get around these in a couple of different ways (storing in adata layers instead, wrapping the anndata, using mudata, etc...). You'll notice quickly though that this is only the tip of the iceberg, since you'll also run into issues with the dataloaders, calculation of the contribution scores, etc further along the CREsted pipeline.

It's definitely useful though which is why we're actively working on this exact feature, but I can't say yet when we would be releasing this to the public.

@mtvector
Copy link
Author

mtvector commented Dec 28, 2024

That's great, I did some thinking about how to achieve this in a seamless manner. My thought is to create a custom class which spoofs anndata into thinking it's a 2D matrix of the correct size, but when an element indexed, it retrieves row elements from a hdf5 backed matrix (might need anndata shadows or something to prevent adata.uns['tracks'] from loading into memory), effectively using lazy matrix as the third dimension of an on-disk tensor.

import numpy as np

class LazyMatrix(np.ndarray):
    """
    A lazy proxy matrix that dynamically retrieves data slices from a source matrix.

    LazyMatrix behaves like a 2D matrix, but the data for each "row" corresponds to slices
    from a specified source matrix, effectively treating it as the 3rd dimension of a 3D tensor.

    Parameters
    ----------
    source : object
        The data source to retrieve matrix rows from. Must support NumPy-style indexing.
    shape : tuple
        The shape of the proxy matrix. The first two dimensions correspond to the "virtual" 2D matrix,
        while the third dimension corresponds to the number of columns in the source matrix.

    Attributes
    ----------
    source : object
        The data source providing the matrix rows (e.g., HDF5 dataset, NumPy array).
    shape : tuple
        The shape of the virtual 3D tensor.

    Examples
    --------
    >>> import numpy as np
    >>> source = np.random.rand(5, 10)  # 5 rows, 10 features
    >>> lazy_matrix = LazyMatrix(source, shape=(5, 5, 10))  # Virtual 3D matrix
    >>> print(lazy_matrix[0, 1].shape)  # Shape (10,) - Row 1 of the source matrix
    >>> print(lazy_matrix[0:2, 1:3].shape)  # Shape (2, 2, 10) - Slices from the source matrix
    """
    def __new__(cls, source, shape=None):
        # Create a new LazyMatrix instance
        obj = super().__new__(cls, (), dtype=float)  # Pass an empty shape to defer initialization
        obj.source = source  # The data source (e.g., HDF5 dataset)
        obj._lazy_shape = shape if shape else (source.shape[0], source.shape[0], source.shape[1])
        return obj

    def __array_finalize__(self, obj):
        if obj is None:
            return
        self.source = getattr(obj, 'source', None)
        self._lazy_shape = getattr(obj, '_lazy_shape', None)

    def __getitem__(self, index):
        """
        Index into the proxy matrix, retrieving slices from the source matrix.

        Returns a 3D tensor where the third dimension corresponds to rows of the source matrix.
        """
        # Normalize indices for 2D access
        if isinstance(index, tuple):
            row_indices, col_indices = index
        else:
            raise IndexError("LazyMatrix expects 2D indexing, e.g., lazy_matrix[i, j].")

        # Handle slicing for row and column indices
        if isinstance(row_indices, slice):
            row_indices = np.arange(*row_indices.indices(self._lazy_shape[0]))
        else:
            row_indices = np.atleast_1d(row_indices)

        if isinstance(col_indices, slice):
            col_indices = np.arange(*col_indices.indices(self._lazy_shape[1]))
        else:
            col_indices = np.atleast_1d(col_indices)

        # Create the result tensor
        result = np.empty((len(row_indices), len(col_indices), self.source.shape[1]))
        for i, row_idx in enumerate(row_indices):
            for j, col_idx in enumerate(col_indices):
                result[i, j, :] = self.source[row_idx]

        return result

    @property
    def shape(self):
        return self._lazy_shape

    def __repr__(self):
        return f"<LazyMatrix shape={self._lazy_shape}>"

Not sure if that is helpful, but thought I'd share because it seems like an interesting problem.

@LukasMahieu
Copy link
Collaborator

Nice, that's a quite elegant solution. I need to look at this a bit more in depth and compare to how we we're implementing it now. I'll get back to you soon on this, thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants