Skip to content

Commit

Permalink
update of 10x genomic tutorial...
Browse files Browse the repository at this point in the history
  • Loading branch information
lchorbadjiev committed Jun 11, 2020
1 parent db095e8 commit 51043d2
Show file tree
Hide file tree
Showing 2 changed files with 299 additions and 0 deletions.
Binary file added docs/figs/10x_genomics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
299 changes: 299 additions & 0 deletions docs/tutorial-10x-genomics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# Example usage of `sGAINS` pipeline with 10x Genomics dataset

## Create working directory

Let us start with creating a working directory where to place all data files
we plan to process with `s-GAINS` pipeline:

```
mkdir 10x-genomics
cd 10x-genomics
```

All instructions bellow asume that you are working inside this directory.


## Download example 10x Genomics dataset

In this tutorial we are going to use
[BJ with 10% MKN-45 Spike-In](https://support.10xgenomics.com/single-cell-dna/datasets/1.1.0/bj_mkn45_10pct)
dataset that is available from 10x Genomics site. This dataset contains data from BJ Fibroblast Euploid Cell
Line with a 10% spike-in of cells from MKN-45 Gastric Cancer Cell Line MKN-45 cells.

For processing with `s-GAINS` we are going to need following files:

* [Per-cell summary metrics](http://cf.10xgenomics.com/samples/cell-dna/1.1.0/bj_mkn45_10pct/bj_mkn45_10pct_per_cell_summary_metrics.csv)

* [Position-sorted alignments (BAM)](http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-dna/1.1.0/bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam)

* [Position-sorted alignments index](http://cf.10xgenomics.com/samples/cell-dna/1.1.0/bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam.bai)

You can download these files either using your browser or using command line utility such as `wget`:

```
mkdir bj_mkn45_10pct
cd bj_mkn45_10pct
wget -c http://cf.10xgenomics.com/samples/cell-dna/1.1.0/bj_mkn45_10pct/bj_mkn45_10pct_per_cell_summary_metrics.csv
wget -c http://cf.10xgenomics.com/samples/cell-dna/1.1.0/bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam.bai
wget -c http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-dna/1.1.0/bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam
cd -
```

The content of this data directory should look like:

```
bj_mkn45_10pct/
├── bj_mkn45_10pct_per_cell_summary_metrics.csv
├── bj_mkn45_10pct_possorted_bam.bam
└── bj_mkn45_10pct_possorted_bam.bam.bai
```


## Download human reference genome

* We are going to need a copy of human reference genome version `hg19`. To download
it you can go to [UCSC Genome Browser](https://genome.ucsc.edu/), locate the
downloads section and find full data set for *GRCh37/hg19* version of human
reference genome.

* Download archive file `chromFa.tar.gz` and untar it into a separate directory:

```
mkdir hg19_pristine
cd hg19_pristine
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar zxvf chromFa.tar.gz
cd -
```
* Download `cytoBand.txt.gz` for HG19 and store it inside our working subdirectory. This
file will be needed for last step in last step of the pipeline - `scclust`.
```
wget -c http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
gunzip cytoBand.txt.gz
mv cytoBand.txt cytoBand-hg19.txt
```
## `s-GAINS` configuration file
Since the pipeline has many parameters you can create a configuration file, that
sets values for most of parameters used by the pipeline.
The configuration file is in YAML format.
Let us create a `s-GAINS` configuration file named `sgains_hisat2_10x_bj_mkn45_10pct.yml`
with following content:
```yaml
aligner:
aligner_name: hisat2
genome:
genome_version: hg19
genome_pristine_dir: hg19_pristine
genome_dir: hg19
genomeindex_prefix: genomeindex
mappable_regions:
mappable_read_length: 100
mappable_dir: hg19_R100
mappable_file: hisat2_hg19_R100_mappable_regions.txt
mappable_aligner_options: ""
bins:
bins_count: 10000
bins_dir: hg19_R100_B10k
bins_file: hisat2_hg19_R100_B10k_bins_boundaries.txt
data_10x:
data_10x_dir: bj_mkn45_10pct
data_10x_prefix: "bj_mkn45_10pct"
data_10x_cell_summary: bj_mkn45_10pct/bj_mkn45_10pct_per_cell_summary_metrics.csv
data_10x_bam: bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam
data_10x_bai: bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam.bai
reads:
reads_dir: results/bj_mkn45_10pct/reads
reads_suffix: ".fastq.gz"
mapping:
mapping_dir: results/bj_mkn45_10pct/mapping
mapping_suffix: ".rmdup.bam"
mapping_aligner_options: ""
varbin:
varbin_dir: results/bj_mkn45_10pct/varbin_10x
varbin_suffix: ".varbin.r100_10k.txt"
scclust:
scclust_case: "bj_mkn45_10pct"
scclust_dir: results/bj_mkn45_10pct/scclust_10x
scclust_cytoband_file: cytoBand-hg19.txt
scclust_nsim: 150
scclust_sharemin: 0.80
scclust_fdrthres: -3
scclust_nshare: 4
scclust_climbtoshare: 5
```

## `s-GAINS` pipeline preparation

### Preparation of genome index

* Run `genomeindex` subcommand to copy and modify `hg19` reference genome from
your `hg19_pristine` copy into working `hg19` subdirectory:

```
sgains-tools -c sgains_hisat2_10x_bj_mkn45_10pct.yml \
genomeindex --genome-pristine-dir hg19_pristine --genome-dir hg19
```

* This step will use the specified aligner to produce genome index of the *hg19*
reference genome. Building this index is computationally intensive process and
could take several hours of CPU time.


### Preparation of uniquely mappable regions

* The next step is to generate uniquely mappable regions, i.e. contiguous regions wherein
all reads of a given length are unique in the genome. To this end you can use
`mappalbe-regions` subcommand of `s-GAINS` pipeline. You need to specify the directory,
where a working copy of reference genome is located and the read length to be used.

* Here is an example of invoking the `mappalbe-regions` subcommand with reads of
length 100:

```
sgains-tools -c sgains_hisat2_10x_bj_mkn45_10pct.yml \
mappable-regions --genome-dir hisat_hg19 \
--mappable-dir hg19_R100 --mappable-read-length 100 \
--mappable-file hisat2_hg19_R100_mappable_regions.txt
````

* This step is computationaly very intensive and could take days in CPU time.
Consider using `--parallel` option of `sgains-tools` command to parallelize the
computation if your computer has a suitable number of cores. For example, on a
workstation with 10 cores you could use 8 cores to compute mappable regions:

```
sgains-tools -p 8 -c sgains_hisat2_10x_bj_mkn45_10pct.yml \
mappable-regions --genome-dir hisat_hg19 \
--mappable-dir hg19_R100 --mappable-read-length 100 \
--mappable-file hisat2_hg19_R100_mappable_regions.txt
````

### Calculation of bin boundaries

In this step we will partition the genome into bins with an expected equal
number of uniquely mappable positions.


* Run `bins` subcommand to calculate bin boundaries. To run the command you need to specify:
* the number of bins you want to calculate
* a directory for storing the bin boundary file
* a directory and file name for mappable regions
* a directory where a working copy of HG19 is located

```
sgains-tools -c sgains_hisat2_10x_bj_mkn45_10pct.yml \
bins \
--mappable-dir hg19_R100 \
--mappable-file hisat2_hg19_R100_mappable_regions.txt \
--genome-dir hg19 \
--bins-count 0000 \
--bins-dir hg19_R100_B10k \
--bins-file hisat2_hg19_R100_B10k_bins_boundaries.txt
```
## Processing data with `s-GAINS` pipeline
### `extract-10x` step
10x Genomics Cell Ranger processing pipeline uses a
[Burrow-Wheeler Aligner](https://github.com/lh3/bwa). At the moment `s-GAINS` does not
support `bwa` aligner, so the recommended way to process 10x-Genomics datasets is to
re-align the dataset using `s-GAINS`.
To this end we first extract fastq reads from
the dataset using `extract-10x` subcommand:
```
sgains-tools -c sgains_hisat2_10x_bj_mkn45_10pct.yml -p 8 \
extract-10x \
--data-10x-dir bj_mkn45_10pct \
--data-10x-prefix bj_mkn45_10pct \
--data-10x-cell-summary bj_mkn45_10pct/bj_mkn45_10pct_per_cell_summary_metrics.csv \
--data-10x-bam bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam \
--data-10x-bai bj_mkn45_10pct/bj_mkn45_10pct_possorted_bam.bam.bai \
--reads-dir results/bj_mkn45_10pct/reads \
--reads-suffix .fastq.gz
```
Alternatively we can rely on our configuration file and use:
```
sgains-tools -c sgains_hisat2_10x_bj_mkn45_10pct.yml -p 8 extract-10x
```
Values for all required parameters will be supplied by the configuration file
`sgains_hisat2_10x_bj_mkn45_10pct.yml`.
### `mapping` step
When we have the reads we can go to the mapping step. By default `sgains-tools` uses unique
mapping of reads, i.e. if the read could be mapped on more than one place on reference genome
it will be discarded.
This step is CPU intensive and could take hours in CPU time.
To start the mapping step you can use:
```
sgains-tools -p 8 -c sgains_hisat2_10x_bj_mkn45_10pct.yml mapping
```
Please note that `-p 8` in the previous command starts 8 processes for
aligning samples on the reference genome. You need to adapt this number depending
on the hardware resources you have in your hands.
### `varbin` step
The `varbin` step counts the number of reads aligned in each uniquely mappable
region of reference genome. For each uniquely mapped read we take the starting
position of the read, find the bin in which it is contained and increment the read
count for this bin.
To start this step you should use:
```
sgains-tools -p 8 -c sgains_hisat2_10x_bj_mkn45_10pct.yml varbin
```
### `scclust` step
This step performs segmentation and clustering of the read counts using `SCclust`
package and prepares the results, that could be visualized using `SCGV` viewer.
To run the step you could use:
```
sgains-tools -c sgains_hisat2_10x_bj_mkn45_10pct.yml scclust
```
This command will read the results from `varbin` step and will store results
in subdirectory as configured in `sgains_hisat2_10x_bj_mkn45_10pct.yml` configuration file.
## Visualization of results
You can visualize results from segmentation and clustering using
[`SCGV` viewer](https://github.com/KrasnitzLab/SCGV).
![10x Genomics Dataset](figs/10x_genomics.png)

0 comments on commit 51043d2

Please sign in to comment.