A vignette for running PCA from scratch for Einstein Omics Club to run on the HPC. Adapted from Kevin Blighe: Biostars Post
- plink v1.9
- plink2 v2.0
- jupyter 1.0.0
- python v3.12.1
- bcftools (tested on v1.10)
Downloaded data (VCF.gz and tab-indices), ~ 15.5 GB converted BCF files and their indices, ~14 GB binary PLINK files, ~53 GB pruned PLINK binary files, ~ <1 GB
The goal of this tutorial is to learn how to analyze genetic variants to understand how a population is structured. The tools we will use in this tutorial are commonly used analysis programs in the field of population genetics that have been applied to studying human genetic diversity across many populations. The standard data type used for these genetic ancestry analyses are single nucleotide polymorphisms (SNPs) generated by genotyping or whole-genome sequencing (WGS) human biological samples.
Warning from Joe Pickrell: "The tools you will run in this tutorial provide summaries of the data that are informative about this question. However, be aware that there is no black box solution--all methods can be misleading, and the main difficulty is not running software, but instead interpreting the results."
We will be analyzing SNPs from the 1000 Genomes Project (Phase 3)
- 2,504 individuals from 26 populations
- 84,700,000 SNPs from low coverage WGS, deep exome sequencing, and dense microarray genotyping.
- 3,600,000 small insertions/deletions (indels)
- 60,000 structural variants
- Haplotype phased
-
99% of these variants occur at >1% frequency for multiple populations.
- We will be using the integrated phased dataset for unrelated samples (aligned to GRCh37)
To get all the scripts used for this analysis:
git clone [email protected]:DavYang/PCA_Vignette.git
Follow the steps in data_processing.md
- Download the dataset.(1000KG SNPs)
- Download the reference genome (GRCh37)
- Convert to bcf file format
- Convert to PLINK format
- Prune variants for ancestry infomrative markers
- Merge across all chromosomes
- Run Principal Components Analysis (PCA)