Replicating Hail's GWAS Tutorial #463

hammer · 2021-02-10T22:25:31Z

I decided to work on https://github.com/pystatgen/sgkit/issues/88 by trying to implement the Hail GWAS Tutorial with sgkit.

I'll update this issue with my experiences. I barely know what I'm doing, so this should be fun.

The text was updated successfully, but these errors were encountered:

hammer · 2021-02-10T22:30:09Z

Preamble

import numpy as np
import pandas as pd
import xarray as xr
import sgkit as sg

VCF_FILE = "https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz"
TXT_FILE = "https://storage.googleapis.com/hail-tutorial/1kg_annotations.txt"

Importing data from VCF

Hail:

hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
mt = hl.read_matrix_table('data/1kg.mt')

sgkit:

sg.io.vcf.vcf_to_zarr(VCF_FILE, "1kg.zarr")
ds = xr.open_zarr("1kg.zarr")

Note that sometimes I see an error when using sg.io.vcf.vcf_to_zarr that sg.io.vcf doesn't exist, but from sgkit.io.vcf import vcf_to_zarr always works. I'm not sure what's going on in my environment but for me it would be nice if we could just install all of io by default now that we have wheels.

Getting to know our data

mt.rows().select().show(5)

It's not so easy for us to build this table. Perhaps we should consider making a data frame out of the variant_ data arrays?

Here's an ugly thing

from itertools import starmap

var_head = ds.isel(variants=slice(None,5))
locus = list(starmap(lambda chr, pos: str(chr) + ':' + str(pos), list(zip(var_head.variant_contig.values, var_head.variant_position.values))))
list(zip(locus, var_head.variant_allele.values))

We don't seem to be picking up contig metadata such as the genome build (i.e. GRCh37), and we confusingly zero index our contigs so that set(ds.variant_contig.values) runs from 0 to 22. I suppose we can use [ds.contigs[contig] for contig in np.nditer(ds.variant_contig)] to look up a contig name; perhaps we should grow the ds.contigs attr to include all contig metadata?

Also why are we padding variant_allele to length 4 with DEFAULT_ALT_NUMBER? I don't see any comments in scikit-allel's code.

mt.s.show(5) --> ds.sample_id.head(5).values

mt.entry.take(5)

Another one that's not so easy for us. I tried sg.display_genotypes(ds, max_variants=5, max_samples=1) but hit https://github.com/pystatgen/sgkit/issues/462. Even if that had worked, we're missing a lot of variant and genotype call metadata. Should we add that? It was discussed previously but I don't see an issue filed for it.

Adding column fields

I really felt the advantages of sgkit for these operations, though there are a few rough edges that would be nice to sand down.

Hail:

table = (hl.import_table('data/1kg_annotations.txt', impute=True)
         .key_by('Sample'))
mt = mt.annotate_cols(pheno = table[mt.s])

sgkit. The key here was renaming Samples to sample_id from the TSV and making sample_id a dimension coordinate so that we can use Pandas indexes to merge. Should we consider making sample_id a dimension coordinate by default? Should xarray allow specification of join columns with different names?

df = pd.read_csv(TXT_FILE, sep="\t", index_col="Sample")
ds_vcf = ds.swap_dims({"samples":"sample_id"})
ds_txt = pd.DataFrame.to_xarray(df).rename({"Sample":"sample_id"})
ds_merged = ds_vcf.merge(ds_txt, join="left")

Query functions and the Hail Expression Language

Another section in which sgkit looks good. The claim that Hail can use the same interfaces and query language to analyze collections that are much larger is handled for us by Dask. Also note the use of namedtuple instead of Hail's Struct class.

We of course can't plot the DP histogram because we don't have DP values! When we do have those values though I guess we can use xarray.plot.hist or da.histogram, though maybe we should encourage use of plotting libraries directly?

Hail:

table.aggregate(hl.agg.counter(table.SuperPopulation))
table.aggregate(hl.agg.stats(table.CaffeineConsumption))
mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation))
mt.aggregate_cols(hl.agg.stats(mt.pheno.CaffeineConsumption)))

snp_counts = mt.aggregate_rows(hl.agg.counter(hl.Struct(ref=mt.alleles[0], alt=mt.alleles[1])))

p = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
show(p)

sgkit:

ds_txt.SuperPopulation.to_series().value_counts()
ds_txt.CaffeineConsumption.to_series().describe()
ds_merged.SuperPopulation.to_series().value_counts()
ds_merged.CaffeineConsumption.to_series().describe()

from collections import namedtuple
SNP = namedtuple('SNP', ['ref', 'alt'])
snp_counts = [SNP(a[0], a[1]) for a in ds_merged.variant_allele.values]

Quality Control

For some reason I am not seeing sg.sample_stats or sg.variant_stats. I do see sg.stats.aggregation.variant_stats, but not sg.stats.aggregation.sample_stats?!
ValueError: 'samples' not found in array dimensions ('variants', 'sample_id')

tomwhite · 2021-02-22T10:58:25Z

Note that sometimes I see an error when using sg.io.vcf.vcf_to_zarr that sg.io.vcf doesn't exist, but from sgkit.io.vcf import vcf_to_zarr always works.

The vcf_to_zarr function is not imported in the top-level sgkit namespace (i.e. by sgkit/__init__.py), so sg.io.vcf.vcf_to_zarr will give an error after import sgkit as sg. However, it will work if you have run from sgkit.io.vcf import vcf_to_zarr too, which is confusing.

For the time being, I would suggest you use from sgkit.io.vcf import vcf_to_zarr - until we consolidate the "extra" packages into one.

For some reason I am not seeing sg.sample_stats or sg.variant_stats. I do see sg.stats.aggregation.variant_stats, but not sg.stats.aggregation.sample_stats?!

sample_stats was added after the 0.1.0a1 release, so you won't find it. variant_stats was there, and should show up as sg.variant_stats - it does for me when I try to import it, so I'm not sure why you can't import it.

It might be better to use the latest code from github rather than the 0.1.0a1 release, so you also get recent fixes like #465

pip install git+https://github.com/pystatgen/sgkit#egg=sgkit

BTW you should use sg.load_dataset instead of xr.open_zarr. See #298 for the rationale.

tomwhite · 2021-02-23T10:00:45Z

Also why are we padding variant_allele to length 4 with DEFAULT_ALT_NUMBER? I don't see any comments in scikit-allel's code.

We don't know the maximum number of alternate alleles ahead of parsing the whole VCF, so we have a fixed size that accommodates the number of alleles that we expect. This is the approach that scikit-allel takes. I've opened #470 to expose alt_number in the API so the user has more control over this. If you just want a bi-allelic representation you can set alt_number to 1, for example.

tomwhite · 2021-02-23T10:22:59Z

You can set an index to show more variant information when displaying genotypes, like this:

>>> ds2 = ds.set_index({"variants": ("variant_contig", "variant_position", "variant_id")})
>>> sg.display_genotypes(ds2, max_variants=10, max_samples=5)
samples            HG00096 HG00099  ... NA21133 NA21143
variants                            ...                
(0, 904165, .)         0/0     0/0  ...     0/0     0/0
(0, 909917, .)         0/0     0/0  ...     0/0     0/0
(0, 986963, .)         0/0     0/0  ...     0/0     0/0
(0, 1563691, .)        ./.     0/0  ...     0/0     0/0
(0, 1707740, .)        0/1     0/1  ...     0/1     0/0
...                    ...     ...  ...     ...     ...
(22, 152660491, .)     ./.     0/0  ...     1/1     0/0
(22, 153031688, .)     0/0     0/0  ...     0/0     0/0
(22, 153674876, .)     0/0     0/0  ...     0/0     0/0
(22, 153706320, .)     ./.     0/0  ...     0/0     0/0
(22, 154087368, .)     0/0     1/1  ...     1/1     1/1

[10879 rows x 284 columns]

There's some discussion here about the trade offs/efficiency of setting indexes like this here: https://github.com/pystatgen/sgkit/pull/58#issuecomment-669907297

tomwhite · 2021-02-23T10:41:59Z

Regarding contigs, I agree that the variant_contig variable is a bit confusing, since it is an index, not the contig name. We could rename it to variant_contig_index, and/or introduce a variable for the contig name (variant_contig_name or just variant_contig). The zarr array for the names would be small since it would compress very well, but there might be some issues with the in-memory representation for large datasets (@eric-czech might have some thoughts on this).

It would be useful to add length and assembly information about contigs to the dataset. We could change the contigs attribute to be a list of dictionaries:

[
  {"id": "1", "length": "249250621", "assembly": "GRCh37"},
  {"id": "2", "length": "243199373", "assembly": "GRCh37"},
  ...
]

This would however be incompatible with the current contigs attribute which is just a list of strings, but it would be easy to change to this representation (and we might also have the information in a variable as discussed above, so we'd need to change it anyway).

BTW to implement this we'd use code like this to get the contig information from cyvcf2:

>>> from cyvcf2 import VCF
>>> vcf = VCF("https://storage.googleapis.com/hail-tutorial/1kg.vcf.bgz")
>>> [h.info(extra=True) for h in vcf.header_iter() if h["HeaderType"] == "CONTIG"][0]
{'ID': '1', 'HeaderType': 'CONTIG', b'ID': b'1', b'length': b'249250621', b'assembly': b'GRCh37', b'IDX': b'0'}

tomwhite · 2021-02-23T10:49:07Z

Here's another way of displaying the first 5 variants:

>>> ds_variant = ds[[v for v in ds.data_vars if v.startswith("variant_")]]
>>> ds_variant["variant_contig_name"] = (["variants"], [ds.contigs[contig] for contig in ds["variant_contig"].values])
>>> ds_variant = ds_variant.drop_vars("variant_contig")
>>> df_variant = ds_variant.to_dataframe()
>>> df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"]).agg({"variant_allele": lambda x: list(x)}).head(5)
                                                variant_allele
variant_contig_name variant_position variant_id               
1                   904165           .              [G, A, , ]
                    909917           .              [G, A, , ]
                    986963           .              [C, T, , ]
                    1563691          .              [T, G, , ]
                    1707740          .              [T, G, , ]

I agree it should be easier than this!

tomwhite · 2021-02-23T17:45:41Z

With #471, it's possible to create a histogram of DP values:

vcf_to_zarr(VCF_FILE, "1kg.zarr", format_fields=["DP"])
ds = sg.load_dataset("1kg.zarr")
dp = ds.call_DP.where(ds.call_DP >= 0) # filter out missing
xr.plot.hist(dp, range=(0,30), bins=30)

tomwhite · 2021-03-23T17:30:14Z

I've got a first draft of the Hail tutorial in sgkit here (branch). (For comparison, the rendered Hail version is here.)

Overall, it's possible to replicate this GWAS workflow in sgkit. There are a number of outstanding issues/improvements that we could make:

The contig naming and indexing (Enable indexing on datasets by default #473) is still a bit fiddly to get right, and is needed for summaries and labelling chart axes.
The number of entries in the dataset after genotype QC filtering is not identical between the two notebooks. I spent a bit of time trying to track down why, but didn't get very far. (See filter_condition_ab in the notebook.)
The datasets have a substantial number of variables, which causes Xarray to collapse them in the HTML view, which means that new users might miss them. It would be nice if we could control this better. Also, the dataset attributes (that we are not so interested in) are shown expanded - it would be better if they were collapsed.
We don't have any plotting functions in sgkit, so I added functions for Manhattan and QQ plots into the notebook. I think that's OK.
I haven't included the final section on Rare variant analysis. It's somewhat orthogonal to the rest of the notebook, so could be added later.
I've added a few notes about things specific to sgkit, but an overall narrative will need adding.

It's not quite ready for a PR, but I think it would be useful to get some general feedback at this point.

I've also opened a few related issues that we should fix: #506, #507, #509

jeromekelleher · 2021-03-24T08:37:21Z

We don't have any plotting functions in sgkit, so I added functions for Manhattan and QQ plots into the notebook. I think that's OK.

I agree we should keep plotting stuff out of sgkit - it's a separate thing, and we could do a whole other package on it.

jeromekelleher · 2021-03-24T08:41:32Z

I've had a quick look, and looks great! We can make a few things more fluid I agree, but overall the functionality is there. Very exciting!

hammer · 2021-03-24T13:10:28Z

I agree we should keep plotting stuff out of sgkit

I'm not sure I agree with this assertion, but this discussion is out of scope for this issue. To be addressed later!

eric-czech · 2021-04-15T19:02:25Z

I had a few minor suggestions on the notebook after taking a closer look.

This could be a bit clearer:

df_vg = df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"]).agg({"variant_allele": lambda x: tuple(x)})
df_vg.variant_allele.value_counts()
# To:
df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"])["variant_allele"].apply(tuple).value_counts()

Unnecessary output:

xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black")
(array([1.64000e+02, 1.09998e+05, 1.88947e+05, 2.60459e+05, 3.10216e+05,
        3.30876e+05, 3.24341e+05, 3.00648e+05, 2.60992e+05, 2.19818e+05,
# To
xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black"); # trailing semicolon suppresses non-graphical output

Perhaps these parts could be more pipe-y?

ad1 = ds.call_AD.sel(dict(alleles=1)) # filter out missing
ad1 = ad1.where(ad1 >= 0)
adsum = ds.call_AD.sum(dim="alleles")
adsum = adsum.where(adsum != 0) # avoid divide by zero
ab = ad1 / adsum

# To:

# fill rows with nan where no alternate alleles were read or where sum of reads is 0
ad1 = ds.call_AD.sel(dict(alleles=1)).pipe(lambda v: v.where(v >= 0))
adsum = ds.call_AD.sum(dim="alleles").pipe(lambda v: v.where(v != 0))
# compute alternate allele read fraction
ab = ad1 / adsum

If I had to take a guess where the discrepancy is coming from w/ Hail, this could be it if there are any partial calls (i.e. only one missing):

het = GT[..., 0] != GT[..., 1]

Do you know if that ever happens?

A couple other thoughts:

I think the indexing examples here are great since that is effectively the only way to join data in Xarray (afaik). I like it as-is rather than how it would be w/ https://github.com/pystatgen/sgkit/issues/473.
I'm not sure what to do about the contig names .. it is a bit of an eyesore. Has an add_contig_names function been discussed?

tomwhite · 2021-04-20T14:36:36Z

Thanks @eric-czech for all the suggestions! I have included them in the latest notebook.

Do you know if that ever happens?

I think it does, but I need to dig in again to see if the counts are the same between the two.

2. I'm not sure what to do about the contig names .. it is a bit of an eyesore. Has an add_contig_names function been discussed?

It would be good to have some kind of convenience function for that. For the moment it's probably OK in the notebook, but if we find it's a common pattern that would be the time to add a function in the library for it.

tomwhite · 2021-04-27T09:04:16Z

The datasets have a substantial number of variables, which causes Xarray to collapse them in the HTML view, which means that new users might miss them. It would be nice if we could control this better. Also, the dataset attributes (that we are not so interested in) are shown expanded - it would be better if they were collapsed.

Now that pydata/xarray#5126 is in, when the next version of Xarray is released (0.17.1), we can put the following at the top of the notebook to get the effect we want:

xr.set_options(display_expand_attrs=False, display_expand_data_vars=True)

tomwhite · 2021-05-07T10:23:39Z

Xarray 0.18.0 has been released with the expand/collapse change, so I updated the notebook here: https://nbviewer.jupyter.org/github/tomwhite/sgkit/blob/a6d850de9056f80a0f82ab72a4de158a630920f6/docs/examples/gwas_tutorial.ipynb

hammer · 2021-06-08T20:04:56Z

@tomwhite you recommended

pip install git+https://github.com/pystatgen/sgkit#egg=sgkit

to pick up the latest commits. When I install this way, I get an ImportError with the message

sgkit-vcf requirements are not installed.

Please install them via pip :

  pip install 'sgkit[vcf]'

I've tried a few different ways to resolve these import errors and a manual installation of cyvcf2 and yarl does the trick. Is there a one liner that will install everything from git without requiring the extra pip work?

Also, can you comment on the best way to go from VCF to Zarr today? Your notebook has commented out the line

# vcf_to_zarr(VCF_FILE, "1kg.zarr", max_alt_alleles=1,
#           fields=["FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
#           field_defs={"FORMAT/AD": {"Number": "R"}})

Is there a cleaner way to do it?

Thanks!

hammer · 2021-06-08T20:15:15Z

A couple other notes from working through the latest version of the notebook with the latest version of the code.

ds = sg.hardy_weinberg_test(ds) generates a pretty scary MergeWarning. Do we want to suppress that somehow?
In manhattan_plot and qq_plot: variant_p_value --> variant_linreg_p_value

tomwhite · 2021-06-08T20:16:29Z

After the release it should be possible to do pip install sgkit 'sgkit[vcf]' or similar (or follow the new conda directions) to get everything you need. In the meantime you need the install from git.

Also, can you comment on the best way to go from VCF to Zarr today? Your notebook has commented out the line

That's what you need - I commented it out to avoid converting it over and over again when developing the notebook. It can be safely uncommented.

hammer · 2021-06-08T20:21:50Z

@tomwhite okay good deal, it's really unfortunate there's no way to install from git and pick up the dependencies, and it's also unfortunate that we have such a clunky command to read from VCF to Zarr, but we can address those issues at another time!

Through some miracle of git I managed to grab the gwas-tutorial branch from your fork and get it going on my fork over at https://github.com/hammer/sgkit/blob/gwas-tutorial/docs/examples/gwas_tutorial.ipynb. I'll add some narrative text over there this week.

hammer added the documentation Improvements or additions to documentation label Feb 10, 2021

tomwhite mentioned this issue Mar 2, 2021

Enable indexing on datasets by default #473

Open

This was referenced Mar 18, 2021

Investigate using JupyterBook for docs #500

Closed

Allow empty covariate list in gwas_linear_regression #509

Closed

tomwhite mentioned this issue Apr 6, 2021

Fix various indexing bugs #516

Merged

hammer mentioned this issue Apr 19, 2021

Data Management and Summary Statistics with PLINK (2020) #529

Open

hammer added this to the 0.1.0 milestone May 11, 2021

tomwhite mentioned this issue Jun 8, 2021

Example popgen notebook #602

Open

tomwhite mentioned this issue Jul 13, 2021

GWAS tutorial #629

Merged

tomwhite closed this as completed in #629 Jul 30, 2021

tomwhite mentioned this issue Jan 4, 2023

Provide tutorial(s) to introduce new users to sgkit #88

Closed

tomwhite mentioned this issue Jun 13, 2023

Add tutorial on relationship matrices #1072

Merged

1 task

tomwhite mentioned this issue Dec 15, 2023

Don't display attributes expanded for dataset #1157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating Hail's GWAS Tutorial #463

Replicating Hail's GWAS Tutorial #463

hammer commented Feb 10, 2021

hammer commented Feb 10, 2021 •

edited

Loading

tomwhite commented Feb 22, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Mar 23, 2021

jeromekelleher commented Mar 24, 2021 •

edited

Loading

jeromekelleher commented Mar 24, 2021

hammer commented Mar 24, 2021

eric-czech commented Apr 15, 2021

tomwhite commented Apr 20, 2021

tomwhite commented Apr 27, 2021

tomwhite commented May 7, 2021

hammer commented Jun 8, 2021

hammer commented Jun 8, 2021

tomwhite commented Jun 8, 2021

hammer commented Jun 8, 2021 •

edited

Loading

Replicating Hail's GWAS Tutorial #463

Replicating Hail's GWAS Tutorial #463

Comments

hammer commented Feb 10, 2021

hammer commented Feb 10, 2021 • edited Loading

Preamble

Importing data from VCF

Getting to know our data

Adding column fields

Query functions and the Hail Expression Language

Quality Control

tomwhite commented Feb 22, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Feb 23, 2021

tomwhite commented Mar 23, 2021

jeromekelleher commented Mar 24, 2021 • edited Loading

jeromekelleher commented Mar 24, 2021

hammer commented Mar 24, 2021

eric-czech commented Apr 15, 2021

tomwhite commented Apr 20, 2021

tomwhite commented Apr 27, 2021

tomwhite commented May 7, 2021

hammer commented Jun 8, 2021

hammer commented Jun 8, 2021

tomwhite commented Jun 8, 2021

hammer commented Jun 8, 2021 • edited Loading

hammer commented Feb 10, 2021 •

edited

Loading

jeromekelleher commented Mar 24, 2021 •

edited

Loading

hammer commented Jun 8, 2021 •

edited

Loading