Wu Xiaohui, Ji Guoli, Li Qingshun Quinn
Department of Automation, Xiamen University, 422 South Siming Road, Xiamen, Fujian, 361005, China,
Methods Mol Biol. 2015;1255:39-48. doi: 10.1007/978-1-4939-2175-1_4.
Polyadenylation [poly(A)] is an essential posttranscriptional processing step in the maturation of eukaryotic mRNA. The advent of next-generation sequencing (NGS) technology has offered feasible means to generate large-scale data and new opportunities for intensive study of polyadenylation, particularly deep sequencing of the transcriptome targeting the junction of 3'-UTR and the poly(A) tail of the transcript. To take advantage of this unprecedented amount of data, we present an automated workflow to identify polyadenylation sites by integrating NGS data cleaning, processing, mapping, normalizing, and clustering. In this pipeline, a series of Perl scripts are seamlessly integrated to iteratively map the single- or paired-end sequences to the reference genome. After mapping, the poly(A) tags (PATs) at the same genome coordinate are grouped into one cleavage site, and the internal priming artifacts removed. Then the ambiguous region is introduced to parse the genome annotation for cleavage site clustering. Finally, cleavage sites within a close range of 24 nucleotides and from different samples can be clustered into poly(A) clusters. This procedure could be used to identify thousands of reliable poly(A) clusters from millions of NGS sequences in different tissues or treatments.
聚腺苷酸化(poly(A))是真核生物mRNA成熟过程中一个必不可少的转录后加工步骤。新一代测序(NGS)技术的出现为生成大规模数据提供了可行的方法,并为深入研究聚腺苷酸化带来了新机遇,特别是针对转录组3'-UTR与转录本聚(A)尾连接处的深度测序。为了利用这些前所未有的大量数据,我们提出了一种自动化流程,通过整合NGS数据清理、处理、映射、归一化和聚类来识别聚腺苷酸化位点。在这个流程中,一系列Perl脚本被无缝整合,以将单端或双端序列迭代映射到参考基因组。映射后,将相同基因组坐标处的聚(A)标签(PATs)分组到一个切割位点,并去除内部引物假象。然后引入模糊区域以解析基因组注释进行切割位点聚类。最后,在24个核苷酸的近距离范围内且来自不同样本的切割位点可以聚类为聚(A)簇。该程序可用于从不同组织或处理的数百万个NGS序列中识别数千个可靠的聚(A)簇。