Tools for manipulating quantification of transposable element reads from RNA-seq data.
Map reads to reference
Create a GTF
- Curated TE gtf files are available from TEToolkit, here: (github page:
- Can combine transcript and repeatmasker GTF files, filtering out repeats that lie in annotated transcripts
Create a table relating different levels of the TE heirarchy
- Sample code for this (that works for the hg19_rmsk_TE.gtf file) is in
Run featurecounts on each sample
- Settings depend on library type (e.g. stranded, unstranded)
- Can ignore or keep multimapping reads
- Note that featurecounts will by default throw away reads that overlap multiple features by default
Merge featurecounts outputs across samples
- Using
- This generates a file with counts across all samples, and some metadata (e.g. total number of mapped reads in each sample), indicated by a '_' prefix (e.g. '_assigned' for reads assigned to a feature - see featurecounts documentation for details).
Subselect and sum data across transcript types:
- Using split_and_sum_TE_counts
- Creates a count matrix for each type of transcript:
- gene_id
- TEclass_id (most general)
- TEfamily_id
- TEgene_id
- TEtranscript_id (most specific - individual loci)