test-pca

A vignette for running PCA from scratch for Einstein Omics Club to run on the HPC. Adapted from Kevin Blighe: Biostars Post

Program requirements:

plink v1.9
plink2 v2.0
jupyter 1.0.0
python v3.12.1
bcftools (tested on v1.10)

Downloaded data (VCF.gz and tab-indices), ~ 15.5 GB converted BCF files and their indices, ~14 GB binary PLINK files, ~53 GB pruned PLINK binary files, ~ <1 GB

Introduction to the Vignette

The goal of this tutorial is to learn how to analyze genetic variants to understand how a population is structured. The tools we will use in this tutorial are commonly used analysis programs in the field of population genetics that have been applied to studying human genetic diversity across many populations. The standard data type used for these genetic ancestry analyses are single nucleotide polymorphisms (SNPs) generated by genotyping or whole-genome sequencing (WGS) human biological samples.

Warning from Joe Pickrell: "The tools you will run in this tutorial provide summaries of the data that are informative about this question. However, be aware that there is no black box solution--all methods can be misleading, and the main difficulty is not running software, but instead interpreting the results."

Dataset Description (1KG Phase 3)

We will be analyzing SNPs from the 1000 Genomes Project (Phase 3)

2,504 individuals from 26 populations
84,700,000 SNPs from low coverage WGS, deep exome sequencing, and dense microarray genotyping.
3,600,000 small insertions/deletions (indels)
60,000 structural variants
Haplotype phased
99% of these variants occur at >1% frequency for multiple populations.
We will be using the integrated phased dataset for unrelated samples (aligned to GRCh37)

Analysis Pipeline Overview

To get all the scripts used for this analysis:

git clone [email protected]:DavYang/PCA_Vignette.git

Follow the steps in data_processing.md

Download the dataset.(1000KG SNPs)
Download the reference genome (GRCh37)
Convert to bcf file format
Convert to PLINK format
Prune variants for ancestry infomrative markers
Merge across all chromosomes
Run Principal Components Analysis (PCA)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
1_get-data.sh		1_get-data.sh
2_get-ref.sh		2_get-ref.sh
3_convert2bcf.sh		3_convert2bcf.sh
4_convert2PLINK.sh		4_convert2PLINK.sh
5_LD-prune.sh		5_LD-prune.sh
6_merge.sh		6_merge.sh
7_pca.sh		7_pca.sh
README.md		README.md
data_processing.md		data_processing.md
ignore		ignore
plot_PCA.ipynb		plot_PCA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

test-pca

Program requirements:

Introduction to the Vignette

Dataset Description (1KG Phase 3)

Analysis Pipeline Overview

About

Releases

Packages

Languages

DavYang/PCA_Vignette

Folders and files

Latest commit

History

Repository files navigation

test-pca

Program requirements:

Introduction to the Vignette

Dataset Description (1KG Phase 3)

Analysis Pipeline Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages