使用长读长RNA测序对证据驱动的基因组注释策略进行评估。

Evaluation of strategies for evidence-driven genome annotation using long-read RNA-seq.

作者信息

Paniagua Alejandro, Agustín-García Cristina, Pardo-Palacios Francisco J, Brown Thomas, De Maria Maite, Denslow Nancy D, Mazzoni Camila J, Conesa Ana

机构信息

Institute for Integrative Systems Biology, Spanish National Research Council, Paterna 46980, Spain.

Department of Computer Science, Universitat de València, Valencia 46100, Spain.

出版信息

Genome Res. 2025 Apr 14;35(4):1053-1064. doi: 10.1101/gr.279864.124.

DOI:10.1101/gr.279864.124

PMID:39715684

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12047274/

Abstract

While the production of a draft genome has become more accessible due to long-read sequencing, the annotation of these new genomes has not been developed at the same pace. Long-read RNA sequencing offers a promising solution for enhancing gene annotation. In this study, we explore how sequencing platforms, Oxford Nanopore R9.4.1 chemistry or Pacific Biosciences (PacBio) Sequel II CCS, and data processing methods influence evidence-driven genome annotation using long reads. Incorporating PacBio transcripts into our annotation pipeline significantly outperformed traditional methods, such as ab initio predictions and short-read-based annotations. We applied this strategy to a nonmodel species, the Florida manatee, and compared our results to existing short-read-based annotation. At the loci level, both annotations were highly concordant, with 90% agreement. However, at the transcript level, the agreement was only 35%. We identified 4906 novel loci, represented by 5707 isoforms, with 64% of these isoforms matching known sequences in other mammalian species. Overall, our findings underscore the importance of using high-quality curated transcript models in combination with ab initio methods for effective genome annotation.

摘要

虽然由于长读长测序，基因组草图的绘制变得更加容易，但这些新基因组的注释工作却没有跟上同样的步伐。长读长RNA测序为增强基因注释提供了一个很有前景的解决方案。在本研究中，我们探讨了测序平台（牛津纳米孔R9.4.1化学技术或太平洋生物科学公司（PacBio）的Sequel II CCS）以及数据处理方法如何影响使用长读长进行的证据驱动的基因组注释。将PacBio转录本纳入我们的注释流程显著优于传统方法，如从头预测和基于短读长的注释。我们将此策略应用于一个非模式物种——佛罗里达海牛，并将我们的结果与现有的基于短读长的注释进行比较。在基因座水平上，两种注释高度一致，一致性达90%。然而，在转录本水平上，一致性仅为35%。我们鉴定出4906个新基因座，由5707个异构体代表，其中64%的异构体与其他哺乳动物物种中的已知序列匹配。总体而言，我们的研究结果强调了结合使用高质量的经过整理的转录本模型和从头方法进行有效基因组注释的重要性。