-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicating Hail's GWAS Tutorial #463
Comments
Preambleimport numpy as np
import pandas as pd
import xarray as xr
import sgkit as sg
VCF_FILE = "https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz"
TXT_FILE = "https://storage.googleapis.com/hail-tutorial/1kg_annotations.txt" Importing data from VCFHail: hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')
sg.io.vcf.vcf_to_zarr(VCF_FILE, "1kg.zarr")
ds = xr.open_zarr("1kg.zarr") Note that sometimes I see an error when using Getting to know our data
It's not so easy for us to build this table. Perhaps we should consider making a data frame out of the Here's an ugly thing
We don't seem to be picking up contig metadata such as the genome build (i.e. GRCh37), and we confusingly zero index our contigs so that Also why are we padding
Another one that's not so easy for us. I tried Adding column fieldsI really felt the advantages of Hail: table = (hl.import_table('data/1kg_annotations.txt', impute=True)
.key_by('Sample'))
mt = mt.annotate_cols(pheno = table[mt.s])
df = pd.read_csv(TXT_FILE, sep="\t", index_col="Sample")
ds_vcf = ds.swap_dims({"samples":"sample_id"})
ds_txt = pd.DataFrame.to_xarray(df).rename({"Sample":"sample_id"})
ds_merged = ds_vcf.merge(ds_txt, join="left") Query functions and the Hail Expression LanguageAnother section in which We of course can't plot the DP histogram because we don't have DP values! When we do have those values though I guess we can use xarray.plot.hist or da.histogram, though maybe we should encourage use of plotting libraries directly? Hail: table.aggregate(hl.agg.counter(table.SuperPopulation))
table.aggregate(hl.agg.stats(table.CaffeineConsumption))
mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation))
mt.aggregate_cols(hl.agg.stats(mt.pheno.CaffeineConsumption)))
snp_counts = mt.aggregate_rows(hl.agg.counter(hl.Struct(ref=mt.alleles[0], alt=mt.alleles[1])))
p = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
show(p)
ds_txt.SuperPopulation.to_series().value_counts()
ds_txt.CaffeineConsumption.to_series().describe()
ds_merged.SuperPopulation.to_series().value_counts()
ds_merged.CaffeineConsumption.to_series().describe()
from collections import namedtuple
SNP = namedtuple('SNP', ['ref', 'alt'])
snp_counts = [SNP(a[0], a[1]) for a in ds_merged.variant_allele.values] Quality Control
|
The For the time being, I would suggest you use
It might be better to use the latest code from github rather than the 0.1.0a1 release, so you also get recent fixes like #465 pip install git+https://github.com/pystatgen/sgkit#egg=sgkit BTW you should use |
We don't know the maximum number of alternate alleles ahead of parsing the whole VCF, so we have a fixed size that accommodates the number of alleles that we expect. This is the approach that scikit-allel takes. I've opened #470 to expose |
You can set an index to show more variant information when displaying genotypes, like this:
There's some discussion here about the trade offs/efficiency of setting indexes like this here: https://github.com/pystatgen/sgkit/pull/58#issuecomment-669907297 |
Regarding contigs, I agree that the It would be useful to add length and assembly information about contigs to the dataset. We could change the
This would however be incompatible with the current BTW to implement this we'd use code like this to get the contig information from cyvcf2:
|
Here's another way of displaying the first 5 variants:
I agree it should be easier than this! |
With #471, it's possible to create a histogram of DP values: vcf_to_zarr(VCF_FILE, "1kg.zarr", format_fields=["DP"])
ds = sg.load_dataset("1kg.zarr")
dp = ds.call_DP.where(ds.call_DP >= 0) # filter out missing
xr.plot.hist(dp, range=(0,30), bins=30) |
I've got a first draft of the Hail tutorial in sgkit here (branch). (For comparison, the rendered Hail version is here.) Overall, it's possible to replicate this GWAS workflow in sgkit. There are a number of outstanding issues/improvements that we could make:
It's not quite ready for a PR, but I think it would be useful to get some general feedback at this point. I've also opened a few related issues that we should fix: #506, #507, #509 |
I agree we should keep plotting stuff out of sgkit - it's a separate thing, and we could do a whole other package on it. |
I've had a quick look, and looks great! We can make a few things more fluid I agree, but overall the functionality is there. Very exciting! |
I'm not sure I agree with this assertion, but this discussion is out of scope for this issue. To be addressed later! |
I had a few minor suggestions on the notebook after taking a closer look. This could be a bit clearer: df_vg = df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"]).agg({"variant_allele": lambda x: tuple(x)})
df_vg.variant_allele.value_counts()
# To:
df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"])["variant_allele"].apply(tuple).value_counts() Unnecessary output: xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black")
(array([1.64000e+02, 1.09998e+05, 1.88947e+05, 2.60459e+05, 3.10216e+05,
3.30876e+05, 3.24341e+05, 3.00648e+05, 2.60992e+05, 2.19818e+05,
# To
xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black"); # trailing semicolon suppresses non-graphical output Perhaps these parts could be more pipe-y? ad1 = ds.call_AD.sel(dict(alleles=1)) # filter out missing
ad1 = ad1.where(ad1 >= 0)
adsum = ds.call_AD.sum(dim="alleles")
adsum = adsum.where(adsum != 0) # avoid divide by zero
ab = ad1 / adsum
# To:
# fill rows with nan where no alternate alleles were read or where sum of reads is 0
ad1 = ds.call_AD.sel(dict(alleles=1)).pipe(lambda v: v.where(v >= 0))
adsum = ds.call_AD.sum(dim="alleles").pipe(lambda v: v.where(v != 0))
# compute alternate allele read fraction
ab = ad1 / adsum If I had to take a guess where the discrepancy is coming from w/ Hail, this could be it if there are any partial calls (i.e. only one missing): het = GT[..., 0] != GT[..., 1] Do you know if that ever happens? A couple other thoughts:
|
Thanks @eric-czech for all the suggestions! I have included them in the latest notebook.
I think it does, but I need to dig in again to see if the counts are the same between the two.
It would be good to have some kind of convenience function for that. For the moment it's probably OK in the notebook, but if we find it's a common pattern that would be the time to add a function in the library for it. |
Now that pydata/xarray#5126 is in, when the next version of Xarray is released (0.17.1), we can put the following at the top of the notebook to get the effect we want: xr.set_options(display_expand_attrs=False, display_expand_data_vars=True) |
Xarray 0.18.0 has been released with the expand/collapse change, so I updated the notebook here: https://nbviewer.jupyter.org/github/tomwhite/sgkit/blob/a6d850de9056f80a0f82ab72a4de158a630920f6/docs/examples/gwas_tutorial.ipynb |
@tomwhite you recommended
to pick up the latest commits. When I install this way, I get an
I've tried a few different ways to resolve these import errors and a manual installation of Also, can you comment on the best way to go from VCF to Zarr today? Your notebook has commented out the line
Is there a cleaner way to do it? Thanks! |
A couple other notes from working through the latest version of the notebook with the latest version of the code.
|
After the release it should be possible to do
That's what you need - I commented it out to avoid converting it over and over again when developing the notebook. It can be safely uncommented. |
@tomwhite okay good deal, it's really unfortunate there's no way to install from git and pick up the dependencies, and it's also unfortunate that we have such a clunky command to read from VCF to Zarr, but we can address those issues at another time! Through some miracle of git I managed to grab the |
I decided to work on https://github.com/pystatgen/sgkit/issues/88 by trying to implement the Hail GWAS Tutorial with
sgkit
.I'll update this issue with my experiences. I barely know what I'm doing, so this should be fun.
The text was updated successfully, but these errors were encountered: