Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRISPR-Cas9 distance correction solver #211

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

sprillo
Copy link
Collaborator

@sprillo sprillo commented Aug 11, 2023

Implement distance correction scheme for the CRISPR-Cas9 model.

The class implementing this method is CRISPRCas9DistanceCorrectionSolver in the cassiopeia/solver/distance_correction/_crispr_cas9_distance_correction_solver.py module. (The distance_correction subpackage is meant to contain any distance correction methods that might be implemented in the future, possibly for models other than CRISPR-Cas9.)

The solver composes together four steps: (1) mutation proportion estimation, (2) collision probability estimation, (3) distance correction with the estimated mutation proportion and collision probability, and (4) tree topology reconstruction using the corrected distances. The solver is parameterized by these four steps. Note: Due to Numba compilation issues in the DistanceSolver, the function performing the third step is not injected into the solver, but rather determined by using a string identifier specifying the function name.

In the code, I declared some types to improve readability at select places. Underscores are used to denote functions or classes that are internal and are not exposed to the users, mostly to improve legibility.

Tests should be quite comprehensive. However, some tests are marked as slow since they require simulation (they take ~30s on my machine). To run the slow tests, use the --runslow flag, e.g.:python -m pytest test/solver_tests/distance_correction_tests --runslow. (In particular, CodeCov complains but coverage with the slow tests is good.)

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage: 63.07% and project coverage change: -0.29% ⚠️

Comparison is base (f895301) 79.05% compared to head (cd53086) 78.76%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #211      +/-   ##
==========================================
- Coverage   79.05%   78.76%   -0.29%     
==========================================
  Files          85       87       +2     
  Lines        7080     7210     +130     
==========================================
+ Hits         5597     5679      +82     
- Misses       1483     1531      +48     
Files Changed Coverage Δ
...rection/_crispr_cas9_distance_correction_solver.py 62.50% <62.50%> (ø)
cassiopeia/solver/__init__.py 100.00% <100.00%> (ø)
cassiopeia/solver/distance_correction/__init__.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sprillo sprillo self-assigned this Aug 14, 2023
@sprillo sprillo added the enhancement New feature or request label Aug 14, 2023
@sprillo sprillo changed the title [WIP, do not review] Distance correction CRISPR-Cas9 distance correction solver Aug 14, 2023
@sprillo sprillo requested a review from mattjones315 August 14, 2023 18:31
Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sprillo for this fantastic PR! I commend you for lots of great work and appreciate your efforts as always to provide a clear and detailed overview of your code.

I left a series of small-ish comments that I think should be resolved before merging this in, mostly around nit-picky line lengths and some more substantial comments regarding the estimation of the collision probabilities.

More importantly, I am also surprised by the strategy you employed to implement this functionality. I always thought that we would implement this functionality by operating on the character matrix of a tree and producing a corrected dissimilarity_map, not unlike the CassiopeiaSolver.compute_dissimilarity_map functionality. Do you think it's potentially overkill that we have embedded the usage of this functionality to only be pertitent to the CRISPRCas9DistanceCorrectionSolver. In my eyes, it would be ideal to be able to get transformed distances for all sorts of tasks, solving lineages being one of them. Does that make sense?

Always happy to be convinced of this design if it's necessary and thanks in advance for your thoughts.

Invert an increasing function.

Finds x s.t. f(x) = y. Search for x is limited to the range [lower, upper].
Uses 30 fixed iterations of binary search.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make this a parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I made it a parameter num_binary_search_iters and exposed it in the public functions that use it (crispr_cas9_corrected_hamming_distance and crispr_cas9_corrected_ternary_hamming_distance). As a note, 30 iterations gets an accuracy of 2**-31 = 10^-9. I don't think we'll want to set it any higher. The only reason to set it lower would be to speed things up. We would need to do some benchmarking to figure out if it's worth it. The contribution of the inversion procedure to the solver is O(n^2 num_binary_search_iters). This is similar to the cost of computing the dissimilarity map, which is O(n^2 k), and that one needed numba-izing. We could also explore other option to speed up the _inverse function, e.g. using some python library.

return (upper + lower) / 2.0


def _hamming_distance(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function might fit better under cas.solver.dissimilarity_functions. We already have a hamming_distance function that skips missing data, but does not scale the final distance. It might be worthwhile to just update that function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there is some duplication, I think in terms of readability it is better to have the simple versions of Hamming distance and ternary Hamming distance here in the same file as the solver. For example, weighted_hamming_distance from cas.solver.dissimilarity_functions does more than just the ternary case, allowing for aribtrary weights, which are not needed in this case; the hamming_distance in cas.solver.dissimilarity_functions has ignore_missing_state which is not used (and does not normalize). I agree we could extend the hamming_distance in cas.solver.dissimilarity_functions to normalize, but again I think it hampers readability. For such small components that are private to the module, I don't think there's really an issue in code duplication if it improves readability.

Of course, if in the future we want to use cas.solver.dissimilarity_functions instead, we can do it easily since this falls into the implementation detail category.

Here, by `scaled` we mean that the Hamming distance is divided by the total
number of characters considered, thus bringing it into the range [0, 2].

Here, `ternary` means that we score two mutated states that are different
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird/confusing that we're calling this concept ternary now (even if it is correct). Elsewhere we call this weighted_hamming_distance or modified_hamming_distance. Do you think it's confusing that here we call it "ternary"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary Hamming distance is a particular case of a weighted Hamming distance with specific weights, no? I know that by default the weighted Hamming distance of cas.solver.dissimilarity_functions will do a ternary weighting scheme, but I don't think we want to speak of "weighted Hamming distance" in the context of the "CRISPRCas9DistanceCorrectionSolver". It's like using the word "polygons" to talk about "triangles". Am I missing something?


The "corrected" distance is an estimate of true tree distance.

It is assumed that the tree is ultrametric of depth 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep "ternary" as the name, I would suggest defining it here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines +341 to +345
The estimated collision probability is:
q = sum q_i^2
where q_i is the estimated collision probability for state i using
#alleles=i / #alleles>0 (i.e. the number of times state i appears
in the character matrix, divided by the total number of mutations).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below that this is just the fallback in case that a user does not provide a collision probability, and that the user has the opportunity to define their own collision probability estimator, but I can't help but feel we can provide more support in case the state probabilities are passed into the solver. Why don't we add an extra optional argument called priors that will allow us to compute the more traditional collision probability for the user? And only if priors=None will we perform this naive estimation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if the user knows the collision probability they can just inject it into the CRISPRCas9DistanceCorrectionSolver using the crispr_cas9_hardcoded_collision_probability_estimator, as follows:

collision_probability_computed_from_priors = sum([p ** 2 for p in priors])
solver = CRISPRCas9DistanceCorrectionSolver(
    collision_probability_estimator=partial(
        crispr_cas9_hardcoded_collision_probability_estimator,
        collision_probability=collision_probability_computed_from_priors,
    )
)

This is exactly how I'm doing it for the experiments on the KP data, and I think it keeps the implementation of CRISPRCas9DistanceCorrectionSolver straightforward and clear. I am not a fan of conditional logic, but if you are thinking of something like this (where we use the tree's priors if they exist), I am happy to add it:

class CRISPRCas9DistanceCorrectionSolver(CassiopeiaSolver):
    def __init__(
        ...,
        use_priors_to_compute_collision_probability: bool = True,
    ):
        ...
        self._use_priors_to_compute_collision_probability = use_priors_to_compute_collision_probability

    def solve(...):
        if self._use_priors_to_compute_collision_probability and tree.priors is not None:
            collision_probability = sum([p ** 2 for p in priors])
        else:
            collision_probability = collision_probability_estimator(...)

I guess it would do more handholding for the user, at the expense of readability. Another option is to make the CollisionProbabilityEstimatorType a function of the CassiopeiaTree instead of a function of the character matrix, and then default it to the estimator that uses the priors if they are available in the tree and otherwise the current default. In this case, I am not a fan of taking a complex CassiopeiaTree object as an input because it makes it confusing what parts of the tree are being used to do the computation. Indeed, the CassiopeiaTree essentially acts like a huge data clump, and therefore saying that a function takes in CassiopeiaTree doesn't clarify what is being used unless you carefully read the docstring (in this case, the tree.priors would be used). I generally prefer functions that depend on simpler objects. They also require less context to be tested and are more reusable.

Comment on lines 504 to 507
mutation_proportion_estimator: MutationProportionEstimatorType = crispr_cas9_default_mutation_proportion_estimator, # noqa
collision_probability_estimator: CollisionProbabilityEstimatorType = crispr_cas9_default_collision_probability_estimator, # noqa
distance_corrector_name: str = "crispr_cas9_corrected_ternary_hamming_distance", # noqa
distance_solver: DistanceSolver = NeighborJoiningSolver(add_root=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these lines can be formatted to be <80 characters long I believe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been formatting my code with isort code.py && black --line-length 80 code.py, and if I added line breaks to these arguments black formated them back to one lone line each. But now I flanked the code with "# fmt: off" and "# fmt: on" so that black won't do this. Now all lines should be <= 80, with the caveat that "# fmt: off" and "# fmt: on" are there.

@sprillo
Copy link
Collaborator Author

sprillo commented Aug 17, 2023

More importantly, I am also surprised by the strategy you employed to implement this functionality. I always thought that we would implement this functionality by operating on the character matrix of a tree and producing a corrected dissimilarity_map, not unlike the CassiopeiaSolver.compute_dissimilarity_map functionality. Do you think it's potentially overkill that we have embedded the usage of this functionality to only be pertitent to the CRISPRCas9DistanceCorrectionSolver. In my eyes, it would be ideal to be able to get transformed distances for all sorts of tasks, solving lineages being one of them. Does that make sense?

@mattjones315 I think CassiopeiaSolver.compute_dissimilarity_map does not exist, do you mean cassiopeia.data.utilities.compute_dissimilarity_map? If you give me an example of the API and code you envision users to be writing, I can think about it. By the way, the more helpful distances would be those obtained from performing branch length estimation in the next step, since it essentially refines the corrected distances even further by using higher-order information rather than just pairs of leaves, and also uses MLE instead of moment-matching.

@mattjones315
Copy link
Collaborator

Hi @sprillo -- apologies for the delay getting back to this.

Also apologies for the miscommunication, I had meant CassiopeiaTree.compute_dissimilarity_map method, but this is functionally equivalent to the function you pointed out (at least for the purposes of this conversation).

As for the interface that I had imagined, I could see us using compositional programming within the compute_dissimilarity_map method. The pseudocode would look like this:

def compute_dissimilarity_map(character_matrix, tree, correction="none", ...):

    if correction == "none":
       dissimilarity_map = cas.data.utilities.compute_dissimilarity_map(character_matrix, tree, ...)
    elif correction == "cas9":
        dissimilarity_map = cas.data.distance_correction.crispr_cas9_distance_correction(character_matrix, tree, ...)
    else:
        raise Exception("correction method not found.")
    return dissimilarity_map

In fact, currently the compute_dissimilarity_map implemented in CassiopieaTree contains a function argument that's used for computing dissimilarities between character matrix entries; one could imagine just injecting the crispr_cas9_distance_function into the function that way.

So, while I'm not convinced either of the two approaches above are categorically better / simpler than your proposed implementation, I wonder if you might be able to convince me that my two solutions are non-optimal.

Thanks in advance for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants