Li Xiang, Shao Mingfu
ArXiv. 2023 Mar 27:arXiv:2303.15594v1.
The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome - a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph(cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.
高通量短读长RNA测序协议通常会产生双端读数,片段的中间部分未被测序。我们探讨在没有参考基因组的情况下,能否从已测序的两端通过计算重建全长片段——我们将此问题称为从头桥接。解决这个问题可以提供更长、信息更丰富的RNA测序读数,并有利于下游的RNA测序分析,如转录本组装、表达定量和剪接差异分析。然而,由于可变剪接、转录本噪声和测序错误,从头桥接是一项具有挑战性和复杂性的任务。目前尚不清楚数据是否提供了足够的信息进行准确桥接,更不用说确定真正桥接的高效算法了。已经有人提出在有参考基因组的情况下桥接双端读数的方法(称为基于参考的桥接),但这些算法远不能用于从头桥接,因为后一项任务中使用的底层压缩德布鲁因图(cdBG)通常包含数百万个顶点和边。我们针对这个问题设计了一种新的截断迪杰斯特拉算法,并提出了一种新颖的算法,该算法重用最短路径树,避免为所有顶点从头运行截断迪杰斯特拉算法以进一步加速。这些创新技术产生了可扩展的算法,能够在具有数百万个顶点的cdBG中桥接所有双端读数。我们的实验表明,双端RNA测序读数在很大程度上可以被准确桥接。所得工具可在https://github.com/Shao-Group/rnabridge-denovo上免费获取。