Department of Computer and Information Sciences, School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA.
Genome Res. 2024 Oct 29;34(10):1624-1635. doi: 10.1101/gr.278659.123.
Mapping transcriptomic variations using either short- or long-read RNA sequencing is a staple of genomic research. Long reads are able to capture entire isoforms and overcome repetitive regions, whereas short reads still provide improved coverage and error rates. Yet, open questions remain, such as how to quantitatively compare the technologies, can we combine them, and what is the benefit of such a combined view? We tackle these questions by first creating a pipeline to assess matched long- and short-read data using a variety of transcriptome statistics. We find that across data sets, algorithms, and technologies, matched short-read data detects ∼30% more splice junctions, such that ∼10%-30% of the splice junctions included at ≥20% by short reads are missed by long reads. In contrast, long reads detect many more intron-retention events and can detect full isoforms, pointing to the benefit of combining the technologies. We introduce MAJIQ-L, an extension of the MAJIQ software, to enable a unified view of transcriptome variations from both technologies and demonstrate its benefits. Our software can be used to assess any future long-read technology or algorithm and can be combined with short-read data for improved transcriptome analysis.
使用短读或长读 RNA 测序来绘制转录组变异是基因组研究的基础。长读能够捕获整个异构体并克服重复区域,而短读仍能提供更好的覆盖度和更低的错误率。然而,仍有一些悬而未决的问题,例如如何对这些技术进行定量比较,我们能否将它们结合起来,以及这种组合视图有什么好处?我们通过首先创建一个使用各种转录组统计数据来评估匹配的长读和短读数据的管道来解决这些问题。我们发现,在数据集、算法和技术方面,匹配的短读数据检测到的剪接位点大约多 30%,以至于短读数据检测到的剪接位点中有 10%-30%的被长读数据遗漏。相比之下,长读数据检测到更多的内含子保留事件,并且能够检测到完整的异构体,这表明结合使用这两种技术具有优势。我们引入了 MAJIQ-L,这是 MAJIQ 软件的扩展,用于实现两种技术的转录组变异的统一视图,并展示了其优势。我们的软件可用于评估任何未来的长读技术或算法,并可与短读数据结合使用,以提高转录组分析的准确性。