Tutorial for smRNA-Seq analysis

Here summarizes the workflow for quantifying the expression of miRNAs using smRNA-Seq data.

File format for smRNA-Seq

Due to the highly redundant of sequenced reads in smRNA-Seq, their data usually saved in a read-tab-count like file as indicated below. Data set used can be downloaded from here.

#SEQUENCE = 
#COUNT = counts
SEQUENCE	COUNT
TAGCTTATCAGACTGATGTTGACTCGTATGCCGTCT	84080
TAGCTTATCAGACTGATGTTGATCGTATGCCGTCTT	76676
TTAACGCGGCCGCTCTACAATAGTGATCGTATGCCG	75453
AGCGTGTAGGGATCCAAATCGTATGCCGTCTTCTGC	67857
AGCGTGTAGGGATCCAAATCGTATGCCGTCTTCTGT	63472
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	36186
TGAGGTAGTAGTTTGTGCTGTTTCGTATGCCGTCTT	35574
TGGCTCAGTTCAGCAGGAACAGTTCGTATGCCGTCT	27660

Transfer read-tab-count to FASTA file for supplying to quantifier.pl in mirDeep2

Since quantifier.pl requires FASTA file in specific formats, one can use collapsemiRNAreads.py at my Github.

collapsemiRNAreads.py -i GSM416753.txt -H 3 -s 'MEF' >GSM416753.fa

#Currently if you have multiple `read-tab-count` file for multiple samples, you have to transfer them separately with different `three-letter-symbol` given to `-s` to indicate the origin of reads and concatenate each file using `cat` command to one final file. 

FASTA format:

The name in FASTA file must obey `SSS_INT_xINT` format.
  SSS is a three letter code indicating the sample origin
  INT is just a running number
  xINT is the number of read occurrences

Clip adaptor

Once got the reads file, please check if the adaptor is removed. Normally if all reads have same length, we can assume no adaptor-removing is performed. Even unfortunately, researchers normally do not provide adaptor sequences. One can get adaptor sequences by performing multiple sequence aligenment (MSA) of first tens to hundards reads and selectling the end common sequence as adaptor. T-coffee is a great on-line tool to perform MSA.

Once getting the adaptor sequence, one can use fastx_clipper from FASTX tools.

fastx_clipper -a TCGTATGCCGT -l 17 -v -i GSM416732.fa -o GSM416732.clipped.fa
#If there is error like 'invalid quality score value', add '-Q 33' as below
fastx_clipper -a TCGTATGCCGT -l 17 -v -i GSM416732.fa -o GSM416732.clipped.fa -Q 33 

Here I used Mega result as an example. The left colorfull letters show the MSA result of 25 reads and black-boxed 11 letters were selected as adaptor sequence. The length of adaptor sequence normally should be larger than 10. The right txt shows the adaptor-clipping result connected by thin-lines.

MSA and adaptor clipping

Quantify miRNA expression

Use the following command to quantify miRNA expression and check your result in file miRNAs_expressed_all_samples_hela.csv.

quantifier.pl -p hsa.hairpin.fa -m hsa.mature.fa -r GSM416732.clipped.fa -t hsa -j -W -y hela

Trim reads and quantify miRNA expression again

The 3’ ends of canonical miRNAs are often subject to untemplated additions, especially the 39 ends of mirtron-3p species. Then sometimes we want to trim 3’ end reads one by one and perform mapping process for each trim.

Here I constrcuted a flow to simply the process. All one need is the main program quantifier.sh, and depeneded three programs trimFasta.py, quantifier.modified.pl a modified version of quantifier.pl.

quantifier.sh -p hsa.hairpin.fa -m hsa.mature.fa -r GSM416732.clipped.fa -t hsa -y hela

The principle is like described below. First, map all reads to miRNA precursor using; Second, save mapped reads; Third, extract unmapped reads and trim the 3’ last nucleotide; Forth, map trimmed reads again and save those with no more than 20 mapping loci (this number can be changes as wanted) into mapped reads; Fifth, repeat trimming and mapping process until all reads are shorted than a given length or the cycle-index larger than given number; Sixth, map all saved mapped reads.

CHENTONG
版权声明:本文为博主原创文章,转载请注明出处。
alipay.png

CHENTONG

CHENTONG
积微,月不胜日,时不胜月,岁不胜时。凡人好敖慢小事,大事至,然后兴之务之。如是,则常不胜夫敦比于小事者矣!何也?小事之至也数,其悬日也博,其为积也大。大事之至也希,其悬日也浅,其为积也小。故善日者王,善时者霸,补漏者危,大荒者亡!故,王者敬日,霸者敬时,仅存之国危而后戚之。亡国至亡而后知亡,至死而后知死,亡国之祸败,不可胜悔也。霸者之善著也,可以时托也。王者之功名,不可胜日志也。财物货宝以大为重,政教功名者反是,能积微者速成。诗曰:德如毛,民鲜能克举之。此之谓也。

R 学习

R语言是比较常用的统计分析和绘图语言,拥有强大的统计库、绘图库和生信分析的Bioconductor库,是学习生物信息分析的必备语言之一。Rstudio是编辑、运行R语言的最为理想的工具之一,支持纯R脚本、Rmarkdown (脚本文档混排)、Bookdown (脚本文档混排...… Continue reading

本地使用Rfam 12.0+

Published on June 16, 2017

Linux学习(一)- 文件和目录

Published on June 08, 2017