Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cellxgene-schema CLI must add validation for obs['genetic_ancestry_*'] #1114

Open
brianraymor opened this issue Nov 18, 2024 · 4 comments
Open
Assignees
Labels
5.3 Next minor CELLxGENE schema version after 5.2 curation software

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Nov 18, 2024

Changelog

  • obs (Cell metadata)
    • Added genetic_ancestry_African
    • Added genetic_ancestry_East_Asian
    • Added genetic_ancestry_European
    • Added genetic_ancestry_Indigenous_American
    • Added genetic_ancestry_Oceanian
    • Added genetic_ancestry_South_Asian

Design

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then for each observation for the following fields, either all their values must be float("nan") or the sum of their values MUST be1.0:

  • genetic_ancestry_African
  • genetic_ancestry_East_Asian
  • genetic_ancestry_European
  • genetic_ancestry_Indigenous_American
  • genetic_ancestry_Oceanian
  • genetic_ancestry_South_Asian

genetic_ancestry_African

Key genetic_ancestry_African
Annotator Curator MUST annotate.
Value float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0010" for African expressed as a float greater than or equal to 0.0 and less than or equal to 1.0

genetic_ancestry_East_Asian

Key genetic_ancestry_East_Asian
Annotator Curator MUST annotate.
Value float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0009" for East Asian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0

genetic_ancestry_European

Key genetic_ancestry_European
Annotator Curator MUST annotate.
Value float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0005" for European expressed as a float greater than or equal to 0.0 and less than or equal to 1.0

genetic_ancestry_Indigenous_American

Key genetic_ancestry_Indigenous_American
Annotator Curator MUST annotate.
Value float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0013" for Indigenous American expressed as a float greater than or equal to 0.0 and less than or equal to 1.0

genetic_ancestry_Oceanian

Key genetic_ancestry_Oceanian
Annotator Curator MUST annotate.
Value float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0017" for Oceanian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0

genetic_ancestry_South_Asian

Key genetic_ancestry_South_Asian
Annotator Curator MUST annotate.
Value float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0006" for South Asian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0

@brianraymor brianraymor added curation software 5.3 Next minor CELLxGENE schema version after 5.2 labels Nov 18, 2024
@joyceyan joyceyan self-assigned this Nov 21, 2024
@joyceyan
Copy link
Contributor

joyceyan commented Nov 26, 2024

@brianraymor Anndata doesn't seem to support allowing multiple data types in a single column. What do you think of changing the schema so that when organism is not homo sapiens, we require that the value is float('nan') instead of a string "na"?

@brianraymor
Copy link
Contributor Author

brianraymor commented Dec 2, 2024

Noted No support for mixed column type. Confirmed with @ivirshup.

@brianraymor
Copy link
Contributor Author

@joyceyan - I updated the schema (and the top-level summary comment) with your solution. Apologies for missing the Anndata issue with pandas mixed data types.

CC: @jahilton

@joyceyan
Copy link
Contributor

joyceyan commented Dec 5, 2024

Thanks @brianraymor !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.3 Next minor CELLxGENE schema version after 5.2 curation software
Projects
None yet
Development

No branches or pull requests

2 participants