Core commmands and custom Perl scripts used in m6A study (bioinformatic analysis)

1. Quality control of raw sequencing reads

1) using cutadapt to trim off low-quality nucleotides (Phred quality lower than 20) and Illumina adaptor sequences at the 3’ end

for i in "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_In_C3" "s_6_In_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" "s_6_IP_C3" "s_6_IP_C8" do qsub -j y -o qsubout/cutadapt.$i.qsubout -b y -cwd -l h_vmem=50G -N $i /clusterdata/apps/cutadapt-1.3/bin/cutadapt -q 20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/Run/${i}_read1.fastq.gz -o /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/${i}_read1.fastq.gz qsub -j y -o qsubout/cutadapt.$i.qsubout -b y -cwd -l h_vmem=50G -N $i /clusterdata/apps/cutadapt-1.3/bin/cutadapt -q 20 -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/Run/${i}_read2.fastq.gz -o /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/${i}_read2.fastq.gz done

2) using Bowtie2 to align against a custom contaminant list to filter contaminant reads

for i in "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_In_C3" "s_6_In_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" "s_6_IP_C3" "s_6_IP_C8" do qsub -j y -o qsubout/$i.rRNA.PhiX.out -b y -cwd -pe onehost 10 -l h_vmem=3G -N $i.rRNA bowtie2 --local -q --phred33 -p 10 --no-hd --no-unal -x /clusterdata/hiseq_apps/resources/freeze001/mm10/mm10rRNA_PhiX -1 /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/${i}_read1.fastq.gz -2 /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/${i}_read2.fastq.gz -S ${i}.rRNAPhiX.sam done

3) Custom Perl scripts to remove contaminant reads based on bowtie alignment (custom Perl scripts can be downloaded from https://github.com/Qiongyi/m6A/)

for i in "s_6_In_C3" "s_6_In_C8" "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_IP_C3" "s_6_IP_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" do Discard_fastq_PE.pl /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/bowtie2rRNAPhiX/$i.rRNAPhiX.sam /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/${i}_read1.fastq.gz /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/${i}_read2.fastq.gz /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/filter/${i}_read1.fastq /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/filter/${i}_read2.fastq filter_stats.xls done

2. Alignment of quality filtered reads

1) Using Tophat2 to align the quality filtered paired-end reads to the mouse genome (build mm10)

for i in "s_6_In_C3" "s_6_In_C8" "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_IP_C3" "s_6_IP_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" do qsub -j y -o qsubout/$i.tophat.cutadapt.out -b y -cwd -pe onehost 8 -l h_vmem=5G -N cutadapt.$i.tophat tophat --mate-inner-dist 50 --mate-std-dev 50 --microexon-search --min-anchor 10 -p 8 -o ${i}_tophat_filter -G /clusterdata/hiseq_apps/resources/freeze001/mm10/mm10.gtf --transcriptome-index /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/transcriptome_data/known /clusterdata/hiseq_apps/resources/freeze001/mm10/bowtie2_index/mm10 /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/filter/${i}_read1.fastq /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/fastq/cutadapt_new/filter/${i}_read2.fastq done

2) Using Samtools to keep concordantly aligned reads

for i in "s_6_In_C3" "s_6_In_C8" "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_IP_C3" "s_6_IP_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" do samtools view -h -bu -f2 -q 4 /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/accepted_hits.bam |samtools sort - /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.sorted samtools index /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.sorted.bam done

3) Using Picard to mark duplicates

for i in "s_6_In_C3" "s_6_In_C8" "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_IP_C3" "s_6_IP_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" do qsub -j y -o qsubout/MarkDuplicates.$i -b y -cwd -pe onehost 8 -l h_vmem=5G -N $i java -Xmx5g -XX:ParallelGCThreads=8 -jar /clusterdata/hiseq_apps/bin/freeze001/picard/picard-tools-1.72/MarkDuplicates.jar INPUT=/illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.sorted.bam OUTPUT=/illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.asd.bam METRICS_FILE=/illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.asd.bam.dupl AS=true VALIDATION_STRINGENCY=LENIENT done

4) Using Samtools to keep uniquely aligned reads

for i in "s_6_In_C3" "s_6_In_C8" "s_6_In_C13" "s_6_In_C15" "s_6_In_C18" "s_6_In_C20" "s_6_IP_C3" "s_6_IP_C8" "s_6_IP_C13" "s_6_IP_C15" "s_6_IP_C18" "s_6_IP_C20" do samtools view -b -o /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.asd.rmdup.bam -F 1024 /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.asd.bam samtools index /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/${i}_tophat_filter/Properly_accepted_hits.asd.rmdup.bam

3. Detection of m6A sites (winscore method)

for i in "C3" "C8" "C13" "C15" "C18" "C20" do qsub -j y -o qsubout/$i.winscore.out -b y -cwd -l h_vmem=20G -N $i.winscore RNAMethy_winscore.pl /clusterdata/hiseq_apps/resources/freeze001/mm10/mm10.gtf /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/ReadsNum_for_EachGene_normalized_to_1kb.xls /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/s_6_IP_${i}tophat_filter/Properly_accepted_hits.asd.rmdup.bam /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/tophat2/s_6_In${i}_tophat_filter/Properly_accepted_hits.asd.rmdup.bam ${i}.winscore.xls ${i}.winscore.all.xls done

4. Detection of differentially methylated m6A sites

1) get the unique peak summit for each condition

RNAMethy_GetUniqSummit.pl Naive_SummitPosition.xls Naive_UniqSummitPosition.xls RNAMethy_GetUniqSummit.pl Context_SummitPosition.xls Context_UniqSummitPosition.xls RNAMethy_GetUniqSummit.pl FC_SummitPosition.xls FC_UniqSummitPosition.xls

2) combine the summit position for three conditions

RNAMethy_CombineUniqSummit.pl Naive_UniqSummitPosition.xls Context_UniqSummitPosition.xls FC_UniqSummitPosition.xls Naive_Context_FC_SummitPosition.xls

3) remove redundant positions (select highest peak)

RNAMethy_GetUniqSummit2.pl 150 Naive_Context_FC_SummitPosition.xls Naive_Context_FC_SummitPosition.group.xls Naive_Context_FC_UniqSummitPosition.xls

4) student’s t-test

RNAMethy_Summit_TTest.pl Naive_Context_FC_UniqSummitPosition.xls Naive_Context_FC.ttest.xls

5. Analysis of m6A location (shown is an example for Naive group)

1) for unique summit positions

RNAMethy_AddGenomicLocation_Case.pl /clusterdata/hiseq_apps/resources/freeze001/mm10/mm10.gtf Naive_UniqSummitPosition.xls Naive.final.GenomicLocation.xls

2) for all summit positions

RNAMethy_GenomicLocation.pl /clusterdata/hiseq_apps/resources/freeze001/mm10/mm10.gtf Naive_SummitPosition.xls Naive_SummitPosition_TranscriptSegments.xls Naive_SummitPosition_TranscriptSegments_stat.xls

6. Identification and clustering of enriched motifs (shown is an example for Naive group)

1) to count occurrences of each 4–6-nucleotide k-mer

for i in {4..6} do RNAMethy_MotifFinder1_Kmer.pl $i 25 /clusterdata/hiseq_apps/resources/freeze001/mm10/mm10_gene.fasta /illumina/Data/131212_7001408_0063_BC39MNACXX/TBJW_RNAMETHYL/results/winscore/Naive.final.GenomicLocation.xls ../Naive_control_GenomicLocation.xls m6A_Kmers_${i}bp.xls done

2) to select motifs enriched more than 1.5 fold with the corrected P value < 0.05

for i in {4..6} do awk '$1=="Kmer" || ($4>=$5*1.5 && $3<0.05)' m6A_Kmers_${i}bp.xls > m6A_Kmers_${i}bp_1.5fold.xls done

3) to cluster kmers

RNAMethy_MotifFinder2_Cluster.pl 1 m6A_Kmers_4bp_1.5fold.xls m6A_Kmers_5bp_1.5fold.xls m6A_Kmers_6bp_1.5fold.xls Kmer_cluster.xls

4) to calculate the position-specific scoring matrix for detected motif (shown is an example for the first seed)

RNAMethy_MotifFinder3_PSSM.pl Kmer_sequences_seed1.fa seed1.PSSM.xls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Core commmands and custom Perl scripts used in m6A study (bioinformatic analysis)

1. Quality control of raw sequencing reads

1) using cutadapt to trim off low-quality nucleotides (Phred quality lower than 20) and Illumina adaptor sequences at the 3’ end

2) using Bowtie2 to align against a custom contaminant list to filter contaminant reads

3) Custom Perl scripts to remove contaminant reads based on bowtie alignment (custom Perl scripts can be downloaded from https://github.com/Qiongyi/m6A/)

2. Alignment of quality filtered reads

1) Using Tophat2 to align the quality filtered paired-end reads to the mouse genome (build mm10)

2) Using Samtools to keep concordantly aligned reads

3) Using Picard to mark duplicates

4) Using Samtools to keep uniquely aligned reads

3. Detection of m6A sites (winscore method)

4. Detection of differentially methylated m6A sites

1) get the unique peak summit for each condition

2) combine the summit position for three conditions

3) remove redundant positions (select highest peak)

4) student’s t-test

5. Analysis of m6A location (shown is an example for Naive group)

1) for unique summit positions

2) for all summit positions

6. Identification and clustering of enriched motifs (shown is an example for Naive group)

1) to count occurrences of each 4–6-nucleotide k-mer

2) to select motifs enriched more than 1.5 fold with the corrected P value < 0.05

3) to cluster kmers

4) to calculate the position-specific scoring matrix for detected motif (shown is an example for the first seed)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Core commmands and custom Perl scripts used in m6A study (bioinformatic analysis)

1. Quality control of raw sequencing reads

1) using cutadapt to trim off low-quality nucleotides (Phred quality lower than 20) and Illumina adaptor sequences at the 3’ end

2) using Bowtie2 to align against a custom contaminant list to filter contaminant reads

3) Custom Perl scripts to remove contaminant reads based on bowtie alignment (custom Perl scripts can be downloaded from https://github.com/Qiongyi/m6A/)

2. Alignment of quality filtered reads

1) Using Tophat2 to align the quality filtered paired-end reads to the mouse genome (build mm10)

2) Using Samtools to keep concordantly aligned reads

3) Using Picard to mark duplicates

4) Using Samtools to keep uniquely aligned reads

3. Detection of m6A sites (winscore method)

4. Detection of differentially methylated m6A sites

1) get the unique peak summit for each condition

2) combine the summit position for three conditions

3) remove redundant positions (select highest peak)

4) student’s t-test

5. Analysis of m6A location (shown is an example for Naive group)

1) for unique summit positions

2) for all summit positions

6. Identification and clustering of enriched motifs (shown is an example for Naive group)

1) to count occurrences of each 4–6-nucleotide k-mer

2) to select motifs enriched more than 1.5 fold with the corrected P value < 0.05

3) to cluster kmers

4) to calculate the position-specific scoring matrix for detected motif (shown is an example for the first seed)