Han Seong Woo, Jewell San, Thomas-Tikhonenko Andrei, Barash Yoseph
Department of Computer and Information Sciences, School of Engineering, University of Pennsylvania.
Department of Genetics, Perelman School of Medicine, University of Pennsylvania.
bioRxiv. 2023 Nov 21:2023.11.21.568046. doi: 10.1101/2023.11.21.568046.
Mapping transcriptomic variations using either short or long reads RNA sequencing is a staple of genomic research. Long reads are able to capture entire isoforms and overcome repetitive regions, while short reads still provides improved coverage and error rates. Yet how to quantitatively compare the technologies, can we combine those, and what may be the benefit of such a combined view remain open questions. We tackle these questions by first creating a pipeline to assess matched long and short reads data using a variety of transcriptome statistics. We find that across datasets, algorithms and technologies, matched short reads data detects roughly 50% more splice junctions, with 10-30% of the splice junctions included at 20% or more are missed by long reads. In contrast, long reads detect many more intron retention events, pointing to the benefit of combining the technologies. We introduce MAJIQ-L, an extension of the MAJIQ software to enable a unified view of transcriptome variations from both technologies and demonstrate its benefits. Our software can be used to assess any future long reads technology or algorithm, and combine it with short reads data for improved transcriptome analysis.
使用短读长或长读长RNA测序来绘制转录组变异图谱是基因组研究的一项主要内容。长读长能够捕获完整的异构体并克服重复区域,而短读长仍能提供更高的覆盖率和更低的错误率。然而,如何定量比较这些技术、能否将它们结合起来以及这种综合观点可能带来什么好处,仍然是悬而未决的问题。我们通过首先创建一个管道来解决这些问题,该管道使用各种转录组统计数据来评估匹配的长读长和短读长数据。我们发现,在不同的数据集、算法和技术中,匹配的短读长数据检测到的剪接位点大约多50%,长读长会遗漏20%或更多的剪接位点中的10 - 30%。相比之下,长读长检测到更多的内含子保留事件,这表明结合这些技术是有好处的。我们引入了MAJIQ-L,它是MAJIQ软件的扩展,能够对来自这两种技术的转录组变异进行统一查看,并展示了其优势。我们的软件可用于评估任何未来的长读长技术或算法,并将其与短读长数据结合起来以改进转录组分析。