IEEE/ACM Trans Comput Biol Bioinform. 2017 Sep-Oct;14(5):1070-1081. doi: 10.1109/TCBB.2016.2520919. Epub 2016 Jan 26.
We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.
我们提出了一种快速而简单的算法,用于检测全局核 RNA 捕获测序(GRO-seq)中的新生 RNA 转录。GRO-seq 是一种相对较新的方案,可从活跃的聚合酶中捕获新生转录本,直接提供真实转录的读数。大多数传统的检测方法,如 RNA-seq,测量的是稳定状态的 RNA 水平,这些水平受到转录、转录后加工和 RNA 稳定性的影响。然而,GRO-seq 数据提出了独特的分析挑战,这些挑战才刚刚开始得到解决。在这里,我们描述了一种新的算法,Fast Read Stitcher(FStitch),它利用两种流行的机器学习技术,隐马尔可夫模型和逻辑回归,来对基因组的哪些区域进行转录进行分类。给定一个小的用户定义的训练集,我们的算法是准确的,对不同的读深具有鲁棒性,与注释无关,而且速度很快。在没有先验注释的情况下对 GRO-seq 数据进行分析,揭示了转录过程几个方面的惊人新见解。