Jousheghani Zahra Zare, Patro Rob
Department of Electrical and Computer Engineering, University of Maryland, College Park, 20742, Maryland, USA.
Department of Computer Science, University of Maryland, College Park, 20742, Maryland, USA.
bioRxiv. 2024 Mar 1:2024.02.28.582591. doi: 10.1101/2024.02.28.582591.
Long read sequencing technology is becoming an increasingly indispensable tool in genomic and transcriptomic analysis. In transcriptomics in particular, long reads offer the possibility of sequencing full-length isoforms, which can vastly simplify the identification of novel transcripts and transcript quantification. However, despite this promise, the focus of much long read method development to date has been on transcript identification, with comparatively little attention paid to quantification. Yet, due to differences in the underlying protocols and technologies, lower throughput (i.e. fewer reads sequenced per sample compared to short read technologies), as well as technical artifacts, long read quantification remains a challenge, motivating the continued development and assessment of quantification methods tailored to this increasingly prevalent type of data.
We introduce a new method and software tool for long read transcript quantification called oarfish. Our model incorporates a novel and innovative coverage score, which affects the conditional probability of fragment assignment in the underlying probabilistic model. We demonstrate that by accounting for this coverage information, oarfish is able to produce more accurate quantification estimates than existing long read quantification methods, particularly when one considers the primary isoforms present in a particular cell line or tissue type.
Oarfish is implemented in the Rust programming language, and is made available as free and open-source software under the BSD 3-clause license. The source code is available at https://www.github.com/COMBINE-lab/oarfish.
长读长测序技术正日益成为基因组和转录组分析中不可或缺的工具。特别是在转录组学中,长读长为全长异构体测序提供了可能,这可以极大地简化新转录本的鉴定和转录本定量。然而,尽管有此前景,但迄今为止,许多长读长方法的开发重点一直是转录本鉴定,而对定量的关注相对较少。然而,由于底层协议和技术的差异、较低的通量(即与短读长技术相比,每个样本测序的读长较少)以及技术假象,长读长定量仍然是一个挑战,这促使人们继续开发和评估针对这种日益普遍的数据类型的定量方法。
我们介绍了一种名为oarfish的用于长读长转录本定量的新方法和软件工具。我们的模型纳入了一种新颖且创新的覆盖分数,它会影响底层概率模型中片段分配的条件概率。我们证明,通过考虑这种覆盖信息,oarfish能够比现有的长读长定量方法产生更准确的定量估计,特别是当考虑特定细胞系或组织类型中存在的主要异构体时。
oarfish用Rust编程语言实现,并根据BSD 3条款许可作为免费和开源软件提供。源代码可在https://www.github.com/COMBINE-lab/oarfish获取。