Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial on relationship matrices #1072

Merged
merged 4 commits into from
Jun 27, 2023

Conversation

timothymillar
Copy link
Collaborator

It's still a bit rough, but I think the key parts are there. This covers working with pedigree data, calculating relationship matrices from pedigree and molecular data, and combining them into a hybrid relationship matrix. It also showcases the flexibility of having sgkit based on xarray.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great addition.

I read through the text and found a few minor typos:

It is also common to use a 0 to indicate an unknown parent pedigree files.

Change to "an unknown parent in pedigree files".

to ensure they are note mistaken

"not"

dimensions named "samples" and "parent"

"parents"

Using out pedigree dataset

"our"

@timothymillar
Copy link
Collaborator Author

Thanks!

@timothymillar timothymillar force-pushed the 956-relatedness-tutorial branch from 70d45c6 to 02a87ef Compare April 27, 2023 01:42
@timothymillar timothymillar force-pushed the 956-relatedness-tutorial branch from 02a87ef to 11b09a4 Compare May 30, 2023 10:03
@timothymillar timothymillar changed the title [WIP] Add tutorial on relationship matrices Add tutorial on relationship matrices May 31, 2023
@timothymillar
Copy link
Collaborator Author

I think this is getting close now, but CI is failing due to a timeout, possibly the data download? /home/runner/work/sgkit/sgkit/docs/examples/relatedness_tutorial.ipynb: WARNING: Executing notebook failed: CellTimeoutError [mystnb.exec].

I'm also having some issue with building locally as the equations in markdown cells are not being formatted correctly e.g.:
Capture

@tomwhite
Copy link
Collaborator

I think this is getting close now, but CI is failing due to a timeout, possibly the data download?

Where is the data being downloaded from and how big is it? In the GWAS tutorial we download data from Google Cloud - perhaps we could do the same here?

@timothymillar
Copy link
Collaborator Author

@tomwhite I was pulling the data from PMC supplementary material (not ideal). I can cut the data down to a pair of files as 1.1MB and 0.3MB. Should we upload these to the same bucket as the GWAS example?

@tomwhite
Copy link
Collaborator

@tomwhite I was pulling the data from PMC supplementary material (not ideal). I can cut the data down to a pair of files as 1.1MB and 0.3MB. Should we upload these to the same bucket as the GWAS example?

That sounds like a good idea. I just checked and I don't see the GCS bucket in my GCP account, which is odd. @hammer did you create the gs://sgkit-gwas-tutorial bucket for #463?

@hammer
Copy link
Contributor

hammer commented Jun 13, 2023

Maybe? Let me check!

@hammer
Copy link
Contributor

hammer commented Jun 13, 2023

I do not see that bucket on my personal or Hammer Lab account. I may have created it with my Related Sciences account but I can no longer access that. @eric-czech do you happen to see that bucket on any RS-owned accounts?

@eric-czech
Copy link
Collaborator

did you create the gs://sgkit-gwas-tutorial bucket for https://github.com/pystatgen/sgkit/issues/463?

Might have been me too but either way, I'd rather use the project and bucket we have specifically for sgkit. I added you all (@tomwhite, @jeromekelleher, @hammer, @timothymillar) as storage object admins for https://console.cloud.google.com/storage/browser/sgkit-data. There is a validation directory in there used by this workflow so I think a tutorial directory at the top level would make sense. @hammer and @tomwhite I also added you as owners on the GCP project so I believe you should be able to add/remove people from GCS IAM roles in the future.

@timothymillar can you upload this to gs://sgkit-data/tutorial/<whatever_you_like>? Let me know if you get some kind of permissions error.

@timothymillar
Copy link
Collaborator Author

timothymillar commented Jun 13, 2023

Thanks @eric-czech, I was able to create the tutorial dir through the GUI but failing to upload data with a generic "Error".

Edit: Ahh, it's because the whole bucket is set up as "requester pays" which I think means it won't work for tutorials anyway.

@eric-czech
Copy link
Collaborator

Edit: Ahh, it's because the whole bucket is set up as "requester pays" which I think means it won't work for tutorials anyway

Ok, I disabled that. Try again? Everybody I mentioned above should have full control over the bucket now.

@timothymillar timothymillar force-pushed the 956-relatedness-tutorial branch from 11b09a4 to 9307c98 Compare June 14, 2023 22:44
@timothymillar
Copy link
Collaborator Author

Thanks, I've uploaded the data now. The only remaining issue is that the bucket grants read access to "allAuthenticatedUsers" rather than "allUsers":

One or more bucket-level permissions grant access to everyone on the internet (allUsers) or anyone signed into a Google account (allAuthenticatedUsers).

This means a user can't just grab the data with the URL only. I should be able to make this change but thought I should check with whoever is paying the bills! This would also apply to the validation data which is a bit more substantial. This also outdates some of the documentation

@eric-czech
Copy link
Collaborator

The only remaining issue is that the bucket grants read access to "allAuthenticatedUsers" rather than "allUsers"

Ah, didn't realize that. It's "allUsers" now so there should be no more restrictions.

This also outdates some of the documentation

Good catch! That last line could be removed and it would be better if that path to the file was https://storage.googleapis.com/sgkit-data/validation/hapmap_JPT_CHB_r23a_filtered.zip instead.

@timothymillar timothymillar force-pushed the 956-relatedness-tutorial branch 2 times, most recently from c4d8d65 to ee3687d Compare June 15, 2023 01:30
@timothymillar timothymillar force-pushed the 956-relatedness-tutorial branch from ee3687d to d168264 Compare June 15, 2023 01:41
@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2023

Codecov Report

Merging #1072 (ac50e1c) into main (c1f730c) will not change coverage.
The diff coverage is n/a.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff            @@
##              main     #1072   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           49        50    +1     
  Lines         4990      5023   +33     
=========================================
+ Hits          4990      5023   +33     

see 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@timothymillar
Copy link
Collaborator Author

I think this is ready to go unless there are any suggestions? I imagine it'll get updated as relevant features are added.

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example notebook exercising polyploidy and pedigree work
5 participants