Zhang Qimin, Shi Qian, Shao Mingfu
Department of Computer Science and Engineering, School of Electrical Engineering and Computer Science, The Pennsylvania State University.
Huck Institutes of the Life Sciences, The Pennsylvania State University.
Nat Comput Sci. 2022 Mar;2(3):148-152. doi: 10.1038/s43588-022-00216-1. Epub 2022 Mar 28.
Modern RNA-sequencing protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to "bridge" multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge, and (3) piping the refined splice graph and the bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on 10 Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared to two popular assemblers StringTie2 and Scallop.
现代RNA测序方案可以产生多端数据,即源自同一转录本的多个读段被连接到同一个条形码上。多端读段中的长程信息有助于对复杂的剪接异构体进行定相,但缺乏利用此类信息的组装算法。在此,我们介绍了Scallop2,这是一种针对多端RNA-seq数据优化的基于参考的组装器。Scallop2的算法核心由三个步骤组成:(1)在剪接图的背景下,使用一种算法将多端读段“桥接”成单端定相路径;(2)采用一种方法,通过利用未能桥接的多端读段来细化错误的剪接图;(3)将细化后的剪接图和桥接后的定相路径输入到一个整合多个保相分解的算法中。在两个Smart-seq3数据集的561个细胞和10个Illumina双端RNA-seq样本上进行测试,与两种流行的组装器StringTie2和Scallop相比,Scallop2显著提高了组装准确性。