Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
Genome Res. 2024 Oct 11;34(9):1365-1370. doi: 10.1101/gr.279106.124.
Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with their 5' and 3' ends covalently bonded. CircRNAs are known to be more stable than linear RNAs, have distinct properties and functions, and are promising biomarkers. Existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, hence exhibiting unsatisfactory accuracy without a high-quality transcriptome. We present TERRACE, a new algorithm for full-length assembly of circRNAs from paired-end total RNA-seq data. TERRACE uses the splice graph as the underlying data structure that organizes the splicing and coverage information. We transform the problem of assembling circRNAs into finding paths that "bridge" the three fragments in the splice graph induced by back-spliced reads. We adopt a definition for optimal bridging paths and a dynamic programming algorithm to calculate such optimal paths. TERRACE features an efficient algorithm to detect back-spliced reads missed by RNA-seq aligners, contributing to its much-improved sensitivity. It also incorporates a new machine-learning approach trained to assign a confidence score to each assembled circRNA, which is shown to be superior to using abundance for scoring. On both simulations and biological data sets, TERRACE consistently outperforms existing methods by a large margin in sensitivity while achieving better or comparable precision. In particular, when the annotations are not provided, TERRACE assembles 123%-413% more correct circRNAs than state-of-the-art methods. TERRACE presents a significant advance in assembling full-length circRNAs from RNA-seq data, and we expect it to be widely used in future research on circRNAs.
环状 RNA(circRNA)是一类具有共价连接的 5' 和 3' 末端的闭合环状 RNA 分子。circRNA 已知比线性 RNA 更稳定,具有独特的性质和功能,是很有前途的生物标志物。现有的 circRNA 组装方法严重依赖于已注释的转录组,因此在没有高质量转录组的情况下准确性不高。我们提出了 TERRACE,这是一种从配对末端全长 RNA-seq 数据中组装 circRNA 的新算法。TERRACE 使用剪接图作为底层数据结构,组织剪接和覆盖信息。我们将组装 circRNA 的问题转化为寻找“桥接”由反向剪接读段诱导的剪接图中三个片段的路径。我们采用了一种最优桥接路径的定义和动态规划算法来计算这种最优路径。TERRACE 具有一种有效的算法来检测 RNA-seq 比对器错过的反向剪接读段,从而提高了其灵敏度。它还结合了一种新的机器学习方法,该方法经过训练可以为每个组装的 circRNA 分配置信分数,与使用丰度进行评分相比,这种方法表现更优。在模拟和生物数据集上,TERRACE 在灵敏度方面始终优于现有方法,并且在达到更好或可比的精度方面表现更好。特别是,在没有提供注释的情况下,TERRACE 组装的正确 circRNA 比最先进的方法多 123%-413%。TERRACE 在从 RNA-seq 数据中组装全长 circRNA 方面取得了重大进展,我们预计它将在未来的 circRNA 研究中得到广泛应用。