Department of Cellular and Molecular Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0651, USA, Department of Bioinformatics and Systems Biology, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0651, USA, San Diego Center for Systems Biology, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0375, USA, A.I. Virtanen Institute, Department of Biotechnology and Molecular Medicine, University of Eastern Finland, P.O. Box 1627, 70120 Kuopio, Finland, Institute for Genomic Medicine and Scripps Institution of Oceanography, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0651, USA and Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0651, USA.
Nucleic Acids Res. 2014 Feb;42(4):2433-47. doi: 10.1093/nar/gkt1237. Epub 2013 Dec 4.
Global run-on sequencing (GRO-seq) is a recent addition to the series of high-throughput sequencing methods that enables new insights into transcriptional dynamics within a cell. However, GRO-sequencing presents new algorithmic challenges, as existing analysis platforms for ChIP-seq and RNA-seq do not address the unique problem of identifying transcriptional units de novo from short reads located all across the genome. Here, we present a novel algorithm for de novo transcript identification from GRO-sequencing data, along with a system that determines transcript regions, stores them in a relational database and associates them with known reference annotations. We use this method to analyze GRO-sequencing data from primary mouse macrophages and derive novel quantitative insights into the extent and characteristics of non-coding transcription in mammalian cells. In doing so, we demonstrate that Vespucci expands existing annotations for mRNAs and lincRNAs by defining the primary transcript beyond the polyadenylation site. In addition, Vespucci generates assemblies for un-annotated non-coding RNAs such as those transcribed from enhancer-like elements. Vespucci thereby provides a robust system for defining, storing and analyzing diverse classes of primary RNA transcripts that are of increasing biological interest.
全球延伸测序 (GRO-seq) 是高通量测序方法系列中的最新成员,它使人们能够深入了解细胞内的转录动态。然而,GRO-seq 提出了新的算法挑战,因为现有的 ChIP-seq 和 RNA-seq 分析平台并不能解决从位于整个基因组的短读段中从头鉴定转录单元的独特问题。在这里,我们提出了一种从 GRO-seq 数据中从头鉴定转录本的新算法,以及一种确定转录本区域的系统,将它们存储在关系数据库中,并将它们与已知的参考注释相关联。我们使用这种方法来分析来自原代小鼠巨噬细胞的 GRO-seq 数据,并深入了解哺乳动物细胞中非编码转录的程度和特征。通过这种方式,我们证明 Vespucci 通过定义多聚腺苷酸化位点之外的初级转录本,扩展了 mRNAs 和 lincRNAs 的现有注释。此外,Vespucci 还为未注释的非编码 RNA 生成组装,例如从增强子样元件转录的 RNA。因此,Vespucci 为定义、存储和分析越来越具有生物学意义的不同类型的初级 RNA 转录本提供了一个强大的系统。