Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicating Hail's GWAS Tutorial #463

Closed
hammer opened this issue Feb 10, 2021 · 19 comments · Fixed by #629
Closed

Replicating Hail's GWAS Tutorial #463

hammer opened this issue Feb 10, 2021 · 19 comments · Fixed by #629
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@hammer
Copy link
Contributor

hammer commented Feb 10, 2021

I decided to work on https://github.com/pystatgen/sgkit/issues/88 by trying to implement the Hail GWAS Tutorial with sgkit.

I'll update this issue with my experiences. I barely know what I'm doing, so this should be fun.

@hammer hammer added the documentation Improvements or additions to documentation label Feb 10, 2021
@hammer
Copy link
Contributor Author

hammer commented Feb 10, 2021

Preamble

import numpy as np
import pandas as pd
import xarray as xr
import sgkit as sg

VCF_FILE = "https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz"
TXT_FILE = "https://storage.googleapis.com/hail-tutorial/1kg_annotations.txt"

Importing data from VCF

Hail:

hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')

sgkit:

sg.io.vcf.vcf_to_zarr(VCF_FILE, "1kg.zarr")
ds = xr.open_zarr("1kg.zarr")

Note that sometimes I see an error when using sg.io.vcf.vcf_to_zarr that sg.io.vcf doesn't exist, but from sgkit.io.vcf import vcf_to_zarr always works. I'm not sure what's going on in my environment but for me it would be nice if we could just install all of io by default now that we have wheels.

Getting to know our data

mt.rows().select().show(5)

It's not so easy for us to build this table. Perhaps we should consider making a data frame out of the variant_ data arrays?

Here's an ugly thing

from itertools import starmap

var_head = ds.isel(variants=slice(None,5))
locus = list(starmap(lambda chr, pos: str(chr) + ':' + str(pos), list(zip(var_head.variant_contig.values, var_head.variant_position.values))))
list(zip(locus, var_head.variant_allele.values))

We don't seem to be picking up contig metadata such as the genome build (i.e. GRCh37), and we confusingly zero index our contigs so that set(ds.variant_contig.values) runs from 0 to 22. I suppose we can use [ds.contigs[contig] for contig in np.nditer(ds.variant_contig)] to look up a contig name; perhaps we should grow the ds.contigs attr to include all contig metadata?

Also why are we padding variant_allele to length 4 with DEFAULT_ALT_NUMBER? I don't see any comments in scikit-allel's code.

mt.s.show(5) --> ds.sample_id.head(5).values

mt.entry.take(5)

Another one that's not so easy for us. I tried sg.display_genotypes(ds, max_variants=5, max_samples=1) but hit https://github.com/pystatgen/sgkit/issues/462. Even if that had worked, we're missing a lot of variant and genotype call metadata. Should we add that? It was discussed previously but I don't see an issue filed for it.

Adding column fields

I really felt the advantages of sgkit for these operations, though there are a few rough edges that would be nice to sand down.

Hail:

table = (hl.import_table('data/1kg_annotations.txt', impute=True)
         .key_by('Sample'))
mt = mt.annotate_cols(pheno = table[mt.s])

sgkit. The key here was renaming Samples to sample_id from the TSV and making sample_id a dimension coordinate so that we can use Pandas indexes to merge. Should we consider making sample_id a dimension coordinate by default? Should xarray allow specification of join columns with different names?

df = pd.read_csv(TXT_FILE, sep="\t", index_col="Sample")
ds_vcf = ds.swap_dims({"samples":"sample_id"})
ds_txt = pd.DataFrame.to_xarray(df).rename({"Sample":"sample_id"})
ds_merged = ds_vcf.merge(ds_txt, join="left")

Query functions and the Hail Expression Language

Another section in which sgkit looks good. The claim that Hail can use the same interfaces and query language to analyze collections that are much larger is handled for us by Dask. Also note the use of namedtuple instead of Hail's Struct class.

We of course can't plot the DP histogram because we don't have DP values! When we do have those values though I guess we can use xarray.plot.hist or da.histogram, though maybe we should encourage use of plotting libraries directly?

Hail:

table.aggregate(hl.agg.counter(table.SuperPopulation))
table.aggregate(hl.agg.stats(table.CaffeineConsumption))
mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation))
mt.aggregate_cols(hl.agg.stats(mt.pheno.CaffeineConsumption)))

snp_counts = mt.aggregate_rows(hl.agg.counter(hl.Struct(ref=mt.alleles[0], alt=mt.alleles[1])))

p = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
show(p)

sgkit:

ds_txt.SuperPopulation.to_series().value_counts()
ds_txt.CaffeineConsumption.to_series().describe()
ds_merged.SuperPopulation.to_series().value_counts()
ds_merged.CaffeineConsumption.to_series().describe()

from collections import namedtuple
SNP = namedtuple('SNP', ['ref', 'alt'])
snp_counts = [SNP(a[0], a[1]) for a in ds_merged.variant_allele.values]

Quality Control

  • For some reason I am not seeing sg.sample_stats or sg.variant_stats. I do see sg.stats.aggregation.variant_stats, but not sg.stats.aggregation.sample_stats?!
  • ValueError: 'samples' not found in array dimensions ('variants', 'sample_id')

@tomwhite
Copy link
Collaborator

Note that sometimes I see an error when using sg.io.vcf.vcf_to_zarr that sg.io.vcf doesn't exist, but from sgkit.io.vcf import vcf_to_zarr always works.

The vcf_to_zarr function is not imported in the top-level sgkit namespace (i.e. by sgkit/__init__.py), so sg.io.vcf.vcf_to_zarr will give an error after import sgkit as sg. However, it will work if you have run from sgkit.io.vcf import vcf_to_zarr too, which is confusing.

For the time being, I would suggest you use from sgkit.io.vcf import vcf_to_zarr - until we consolidate the "extra" packages into one.

For some reason I am not seeing sg.sample_stats or sg.variant_stats. I do see sg.stats.aggregation.variant_stats, but not sg.stats.aggregation.sample_stats?!

sample_stats was added after the 0.1.0a1 release, so you won't find it. variant_stats was there, and should show up as sg.variant_stats - it does for me when I try to import it, so I'm not sure why you can't import it.

It might be better to use the latest code from github rather than the 0.1.0a1 release, so you also get recent fixes like #465

pip install git+https://github.com/pystatgen/sgkit#egg=sgkit

BTW you should use sg.load_dataset instead of xr.open_zarr. See #298 for the rationale.

@tomwhite
Copy link
Collaborator

Also why are we padding variant_allele to length 4 with DEFAULT_ALT_NUMBER? I don't see any comments in scikit-allel's code.

We don't know the maximum number of alternate alleles ahead of parsing the whole VCF, so we have a fixed size that accommodates the number of alleles that we expect. This is the approach that scikit-allel takes. I've opened #470 to expose alt_number in the API so the user has more control over this. If you just want a bi-allelic representation you can set alt_number to 1, for example.

@tomwhite
Copy link
Collaborator

You can set an index to show more variant information when displaying genotypes, like this:

>>> ds2 = ds.set_index({"variants": ("variant_contig", "variant_position", "variant_id")})
>>> sg.display_genotypes(ds2, max_variants=10, max_samples=5)
samples            HG00096 HG00099  ... NA21133 NA21143
variants                            ...                
(0, 904165, .)         0/0     0/0  ...     0/0     0/0
(0, 909917, .)         0/0     0/0  ...     0/0     0/0
(0, 986963, .)         0/0     0/0  ...     0/0     0/0
(0, 1563691, .)        ./.     0/0  ...     0/0     0/0
(0, 1707740, .)        0/1     0/1  ...     0/1     0/0
...                    ...     ...  ...     ...     ...
(22, 152660491, .)     ./.     0/0  ...     1/1     0/0
(22, 153031688, .)     0/0     0/0  ...     0/0     0/0
(22, 153674876, .)     0/0     0/0  ...     0/0     0/0
(22, 153706320, .)     ./.     0/0  ...     0/0     0/0
(22, 154087368, .)     0/0     1/1  ...     1/1     1/1

[10879 rows x 284 columns]

There's some discussion here about the trade offs/efficiency of setting indexes like this here: https://github.com/pystatgen/sgkit/pull/58#issuecomment-669907297

@tomwhite
Copy link
Collaborator

Regarding contigs, I agree that the variant_contig variable is a bit confusing, since it is an index, not the contig name. We could rename it to variant_contig_index, and/or introduce a variable for the contig name (variant_contig_name or just variant_contig). The zarr array for the names would be small since it would compress very well, but there might be some issues with the in-memory representation for large datasets (@eric-czech might have some thoughts on this).

It would be useful to add length and assembly information about contigs to the dataset. We could change the contigs attribute to be a list of dictionaries:

[
  {"id": "1", "length": "249250621", "assembly": "GRCh37"},
  {"id": "2", "length": "243199373", "assembly": "GRCh37"},
  ...
]

This would however be incompatible with the current contigs attribute which is just a list of strings, but it would be easy to change to this representation (and we might also have the information in a variable as discussed above, so we'd need to change it anyway).

BTW to implement this we'd use code like this to get the contig information from cyvcf2:

>>> from cyvcf2 import VCF
>>> vcf = VCF("https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz")
>>> [h.info(extra=True) for h in vcf.header_iter() if h["HeaderType"] == "CONTIG"][0]
{'ID': '1', 'HeaderType': 'CONTIG', b'ID': b'1', b'length': b'249250621', b'assembly': b'GRCh37', b'IDX': b'0'}

@tomwhite
Copy link
Collaborator

Here's another way of displaying the first 5 variants:

>>> ds_variant = ds[[v for v in ds.data_vars if v.startswith("variant_")]]
>>> ds_variant["variant_contig_name"] = (["variants"], [ds.contigs[contig] for contig in ds["variant_contig"].values])
>>> ds_variant = ds_variant.drop_vars("variant_contig")
>>> df_variant = ds_variant.to_dataframe()
>>> df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"]).agg({"variant_allele": lambda x: list(x)}).head(5)
                                                variant_allele
variant_contig_name variant_position variant_id               
1                   904165           .              [G, A, , ]
                    909917           .              [G, A, , ]
                    986963           .              [C, T, , ]
                    1563691          .              [T, G, , ]
                    1707740          .              [T, G, , ]

I agree it should be easier than this!

@tomwhite
Copy link
Collaborator

With #471, it's possible to create a histogram of DP values:

vcf_to_zarr(VCF_FILE, "1kg.zarr", format_fields=["DP"])
ds = sg.load_dataset("1kg.zarr")
dp = ds.call_DP.where(ds.call_DP >= 0) # filter out missing
xr.plot.hist(dp, range=(0,30), bins=30)

dp

@tomwhite
Copy link
Collaborator

I've got a first draft of the Hail tutorial in sgkit here (branch). (For comparison, the rendered Hail version is here.)

Overall, it's possible to replicate this GWAS workflow in sgkit. There are a number of outstanding issues/improvements that we could make:

  • The contig naming and indexing (Enable indexing on datasets by default #473) is still a bit fiddly to get right, and is needed for summaries and labelling chart axes.
  • The number of entries in the dataset after genotype QC filtering is not identical between the two notebooks. I spent a bit of time trying to track down why, but didn't get very far. (See filter_condition_ab in the notebook.)
  • The datasets have a substantial number of variables, which causes Xarray to collapse them in the HTML view, which means that new users might miss them. It would be nice if we could control this better. Also, the dataset attributes (that we are not so interested in) are shown expanded - it would be better if they were collapsed.
  • We don't have any plotting functions in sgkit, so I added functions for Manhattan and QQ plots into the notebook. I think that's OK.
  • I haven't included the final section on Rare variant analysis. It's somewhat orthogonal to the rest of the notebook, so could be added later.
  • I've added a few notes about things specific to sgkit, but an overall narrative will need adding.

It's not quite ready for a PR, but I think it would be useful to get some general feedback at this point.

I've also opened a few related issues that we should fix: #506, #507, #509

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Mar 24, 2021

We don't have any plotting functions in sgkit, so I added functions for Manhattan and QQ plots into the notebook. I think that's OK.

I agree we should keep plotting stuff out of sgkit - it's a separate thing, and we could do a whole other package on it.

@jeromekelleher
Copy link
Collaborator

I've had a quick look, and looks great! We can make a few things more fluid I agree, but overall the functionality is there. Very exciting!

@hammer
Copy link
Contributor Author

hammer commented Mar 24, 2021

I agree we should keep plotting stuff out of sgkit

I'm not sure I agree with this assertion, but this discussion is out of scope for this issue. To be addressed later!

@eric-czech
Copy link
Collaborator

I had a few minor suggestions on the notebook after taking a closer look.

This could be a bit clearer:

df_vg = df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"]).agg({"variant_allele": lambda x: tuple(x)})
df_vg.variant_allele.value_counts()
# To:
df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"])["variant_allele"].apply(tuple).value_counts()

Unnecessary output:

xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black")
(array([1.64000e+02, 1.09998e+05, 1.88947e+05, 2.60459e+05, 3.10216e+05,
        3.30876e+05, 3.24341e+05, 3.00648e+05, 2.60992e+05, 2.19818e+05,
# To
xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black"); # trailing semicolon suppresses non-graphical output  

Perhaps these parts could be more pipe-y?

ad1 = ds.call_AD.sel(dict(alleles=1)) # filter out missing
ad1 = ad1.where(ad1 >= 0)
adsum = ds.call_AD.sum(dim="alleles")
adsum = adsum.where(adsum != 0) # avoid divide by zero
ab = ad1 / adsum

# To:

# fill rows with nan where no alternate alleles were read or where sum of reads is 0
ad1 = ds.call_AD.sel(dict(alleles=1)).pipe(lambda v: v.where(v >= 0))
adsum = ds.call_AD.sum(dim="alleles").pipe(lambda v: v.where(v != 0))
# compute alternate allele read fraction
ab = ad1 / adsum

If I had to take a guess where the discrepancy is coming from w/ Hail, this could be it if there are any partial calls (i.e. only one missing):

het = GT[..., 0] != GT[..., 1]

Do you know if that ever happens?

A couple other thoughts:

  1. I think the indexing examples here are great since that is effectively the only way to join data in Xarray (afaik). I like it as-is rather than how it would be w/ https://github.com/pystatgen/sgkit/issues/473.
  2. I'm not sure what to do about the contig names .. it is a bit of an eyesore. Has an add_contig_names function been discussed?

@tomwhite
Copy link
Collaborator

Thanks @eric-czech for all the suggestions! I have included them in the latest notebook.

Do you know if that ever happens?

I think it does, but I need to dig in again to see if the counts are the same between the two.

2. I'm not sure what to do about the contig names .. it is a bit of an eyesore. Has an add_contig_names function been discussed?

It would be good to have some kind of convenience function for that. For the moment it's probably OK in the notebook, but if we find it's a common pattern that would be the time to add a function in the library for it.

@tomwhite
Copy link
Collaborator

  • The datasets have a substantial number of variables, which causes Xarray to collapse them in the HTML view, which means that new users might miss them. It would be nice if we could control this better. Also, the dataset attributes (that we are not so interested in) are shown expanded - it would be better if they were collapsed.

Now that pydata/xarray#5126 is in, when the next version of Xarray is released (0.17.1), we can put the following at the top of the notebook to get the effect we want:

xr.set_options(display_expand_attrs=False, display_expand_data_vars=True)

@tomwhite
Copy link
Collaborator

tomwhite commented May 7, 2021

Xarray 0.18.0 has been released with the expand/collapse change, so I updated the notebook here: https://nbviewer.jupyter.org/github/tomwhite/sgkit/blob/a6d850de9056f80a0f82ab72a4de158a630920f6/docs/examples/gwas_tutorial.ipynb

@hammer hammer added this to the 0.1.0 milestone May 11, 2021
@hammer
Copy link
Contributor Author

hammer commented Jun 8, 2021

@tomwhite you recommended

pip install git+https://github.com/pystatgen/sgkit#egg=sgkit

to pick up the latest commits. When I install this way, I get an ImportError with the message

sgkit-vcf requirements are not installed.

Please install them via pip :

  pip install 'sgkit[vcf]'

I've tried a few different ways to resolve these import errors and a manual installation of cyvcf2 and yarl does the trick. Is there a one liner that will install everything from git without requiring the extra pip work?

Also, can you comment on the best way to go from VCF to Zarr today? Your notebook has commented out the line

# vcf_to_zarr(VCF_FILE, "1kg.zarr", max_alt_alleles=1,
#           fields=["FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
#           field_defs={"FORMAT/AD": {"Number": "R"}})

Is there a cleaner way to do it?

Thanks!

@hammer
Copy link
Contributor Author

hammer commented Jun 8, 2021

A couple other notes from working through the latest version of the notebook with the latest version of the code.

  1. ds = sg.hardy_weinberg_test(ds) generates a pretty scary MergeWarning. Do we want to suppress that somehow?
  2. In manhattan_plot and qq_plot: variant_p_value --> variant_linreg_p_value

@tomwhite
Copy link
Collaborator

tomwhite commented Jun 8, 2021

After the release it should be possible to do pip install sgkit 'sgkit[vcf]' or similar (or follow the new conda directions) to get everything you need. In the meantime you need the install from git.

Also, can you comment on the best way to go from VCF to Zarr today? Your notebook has commented out the line

That's what you need - I commented it out to avoid converting it over and over again when developing the notebook. It can be safely uncommented.

@hammer
Copy link
Contributor Author

hammer commented Jun 8, 2021

@tomwhite okay good deal, it's really unfortunate there's no way to install from git and pick up the dependencies, and it's also unfortunate that we have such a clunky command to read from VCF to Zarr, but we can address those issues at another time!

Through some miracle of git I managed to grab the gwas-tutorial branch from your fork and get it going on my fork over at https://github.com/hammer/sgkit/blob/gwas-tutorial/docs/examples/gwas_tutorial.ipynb. I'll add some narrative text over there this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants