Shenker Sol, Miura Pedro, Sanfilippo Piero, Lai Eric C
Department of Developmental Biology, Sloan-Kettering Institute, New York, New York 10065, USA Tri-Institutional Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York 10065, USA.
Department of Developmental Biology, Sloan-Kettering Institute, New York, New York 10065, USA.
RNA. 2015 Jan;21(1):14-27. doi: 10.1261/rna.046037.114. Epub 2014 Nov 18.
Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3' UTRs. Existing assembly strategies often fragment long 3' UTRs, and importantly, none of the algorithms in popular use can apportion data into tandem 3' UTR isoforms, which are frequently generated by alternative cleavage and polyadenylation (APA). Consequently, it is often not possible to identify patterns of differential APA using existing assembly tools. To address these limitations, we present a new method for transcript assembly, Isoform Structural Change Model (IsoSCM) that incorporates change-point analysis to improve the 3' UTR annotation process. Through evaluation on simulated and genuine data sets, we demonstrate that IsoSCM annotates 3' termini with higher sensitivity and specificity than can be achieved with existing methods. We highlight the utility of IsoSCM by demonstrating its ability to recover known patterns of tissue-regulated APA. IsoSCM will facilitate future efforts for 3' UTR annotation and genome-wide studies of the breadth, regulation, and roles of APA leveraging RNA-seq data. The IsoSCM software and source code are available from our website https://github.com/shenkers/isoscm.
RNA测序数据的主要应用包括研究转录组在基因表达和RNA加工水平上是如何被调控的,以及这些事件如何与细胞特性、环境条件和/或疾病状态相关。虽然已经开发了许多优秀的工具来分析RNA测序数据,但这些工具在注释3'非翻译区(3'UTR)方面的功效通常有限。现有的组装策略往往会将长的3'UTR片段化,重要的是,常用的算法都无法将数据分配到串联的3'UTR异构体中,而这些异构体通常是由可变切割和多聚腺苷酸化(APA)产生的。因此,使用现有的组装工具通常无法识别差异APA的模式。为了解决这些限制,我们提出了一种新的转录本组装方法——异构体结构变化模型(IsoSCM),该方法结合了变点分析来改进3'UTR注释过程。通过对模拟数据集和真实数据集的评估,我们证明IsoSCM在注释3'末端时比现有方法具有更高的灵敏度和特异性。我们通过展示其恢复已知的组织调节APA模式的能力来突出IsoSCM的实用性。IsoSCM将有助于未来利用RNA测序数据进行3'UTR注释以及对APA的广度、调控和作用进行全基因组研究。IsoSCM软件和源代码可从我们的网站https://github.com/shenkers/isoscm获取。