Skip to content

Latest commit

 

History

History
115 lines (77 loc) · 5.31 KB

README.md

File metadata and controls

115 lines (77 loc) · 5.31 KB

CustomSelection - package for selection of reference genes from RNAseq data

Citation

dos Santos, K.C.G., Desgagné-Penix, I. & Germain, H. Custom selected reference genes outperform pre-defined reference genes in transcriptomic analysis. BMC Genomics 21, 35 (2020). https://doi.org/10.1186/s12864-019-6426-2

dos Santos, K.C.G., Desgagné-Penix, I. & Germain, H. Correction to: Custom selected reference genes outperform pre-defined reference genes in transcriptomic analysis. BMC Genomics 22, 607 (2021). https://doi.org/10.1186/s12864-021-07743-7

This package calculates the Transcripts per Million (TPM) data frame from the read count matrix, calculates the minimum expressin level for a gene to be considered as expressed in each sample and selects as reference genes those with lowest covariance.

It contains three main functions: Counts_to_tpm, DAFS and gene_selection. The function customReferences merges the three main functions and only returns the data frame with the selected genes (in the row names), their average TPM (column "Mean") and the covariance of their TPM (column "Covariance").

If the you want to keep the tpm data frame and the vector generated from DAFS, run the functions in this specific order:

  1. Counts_to_tpm
  2. DAFS
  3. gene_selection

The package only needs the read count matrix and the length of the genes in the read count matrix. We offer for a test a data frame of read counts for three samples (4 replicates per sample) of Arabidopsis plants expressing GFP (control) or one of two candidate effectors from Melampsora larici-populina (Mlp37347-GFP and Mlp124499-GFP). We also offer the gene lengths from Arabidopsis thaliana TAIR10 obtained from Ensembl with biomaRt.

Count matrix example

Deposited in NCBI GEO under accession GSE136038

Counts_to_tpm

Modified from Slowkow gist

DAFS

Obtained from George and Chang (2014)

Installation

Dependencies

CustomSelection needs the R packages "earth" and "mclust".

R

library(devtools)
install_github("KarenGoncalves/CustomSelection")
library("CustomSelection")    

Workflow

Starting with raw reads run:

  1. Quality filtering Trimmomatic
  2. Alignment (HISAT2, TopHat2) or do a De novo assembly (Trinity, Trans-abyss), then align the reads to the assembly
  3. Sort the alignment (samtools sort)
  4. Get the read count for each gene in each sample (proposed workflow in R below)
  5. Run Counts_to_tpm, DAFS and gene_selection OR run customReferences

Getting read counts

  1. Use the package "GenomicFeatures" to construct a transcript database from biomart. Example, for Arabidopsis thaliana:

      library("GenomicFeatures")
    
      ath <- makeTxDbFromBiomart(biomart = "plants_mart",
                                 dataset = "athaliana_eg_gene",
                                 transcript_ids = NULL,
                                 circ_seqs = DEFAULT_CIRC_SEQS,
                                 filter = NULL,
                                 id_prefix = "ensembl_",
                                 host = "plants.ensembl.org",
                                 taxonomyId = 3702,
                                 miRBaseBuild = NA)
    
      tx <- transcriptsBy(ath)
    
  2. Create a table with the names of the samples as they are in the bam and bai files. Then use the package "Rsamtools" to create the bam file list. Example, for three bam files of three replicates of the control sample (CNT-1, CNT-2, CNT-3):

      samples <- c(paste0("CNT-", 1:3))
    
      sampleTable <- data.frame(sampleName = samples, 
                                fileNameBam = paste0(samples, ".bam"), 
                                fileNameBai = paste0(samples, ".bai"))
    
      library("Rsamtools")
      bfl <- BamFileList(file.path(sampleTable$fileNameBam), 
                         file.path(sampleTable$fileNameBai), 
                         yieldSize = 20000)
    
    
  3. Use the package "GenomicAlignments" to count the number of reads aligned to each gene in the transcript database. Then, create a data frame of read counts per sample. Example for paired reads:

      library("GenomicAlignments")
      txMat=list()
    
      for (i in names(bfl)){
         txMat[[i]] <- assays(summarizeOverlaps(
                         features = tx, 
                         reads = bfl[[i]], 
                         mode = Union, 
                         singleEnd = F, 
                         ignore.strand = F, 
                         fragments = T))$counts
                           
         write.csv(x = as.data.frame(txMat[[i]]), 
                   file = paste(gsub(".bam", "" , i), ".csv", sep = ""))
                }
    
      counts <- do.call(cbind, txMat)
    
      colnames(counts) <- samples