Freedman Adam H, Sackton Timothy B
Informatics Group, Faculty of Arts and Sciences, Harvard University, Cambridge, Massachusetts 02138, USA
Informatics Group, Faculty of Arts and Sciences, Harvard University, Cambridge, Massachusetts 02138, USA.
Genome Res. 2025 May 2;35(5):1261-1276. doi: 10.1101/gr.280377.124.
Recent technological advances in long-read DNA sequencing accompanied by reduction in costs have made the production of genome assemblies financially achievable and computationally feasible, such that genome assembly no longer represents the major hurdle to evolutionary analysis for most nonmodel organisms. Now, the more difficult challenge is to properly annotate a draft genome assembly once it has been constructed. The primary challenge to annotations is how to select from the myriad gene prediction tools that are currently available, determine what kinds of data are necessary to generate high-quality annotations, and evaluate the quality of the annotation. To determine which methods perform the best and to determine whether the inclusion of RNA-seq data is necessary to obtain a high-quality annotation, we generated annotations with 12 different methods for 21 different species spanning vertebrates, plants, and insects. We found that the annotation transfer method TOGA, BRAKER3, and the RNA-seq assembler StringTie were consistently top performers across a variety of metrics including BUSCO recovery, CDS length, and false-positive rate, with the exception that TOGA performed less well in some monocots with respect to BUSCO recovery. The choice of which of the top-performing methods will depend upon the feasibility of whole-genome alignment, availability of RNA-seq data, importance of capturing noncoding parts of the transcriptome, and, when whole-genome alignment is not feasible, the relative performance in BUSCO recovery between BRAKER3 and StringTie. When whole-genome alignment is not feasible, inclusion of RNA-seq data will lead to substantial improvements to genome annotations.
随着成本的降低,长读长DNA测序技术最近取得了进展,使得基因组组装在经济上可行且在计算上可行,因此对于大多数非模式生物来说,基因组组装不再是进化分析的主要障碍。现在,更具挑战性的任务是在构建好基因组草图组装后对其进行正确注释。注释面临的主要挑战是如何从当前可用的众多基因预测工具中进行选择,确定生成高质量注释所需的数据类型,并评估注释的质量。为了确定哪种方法表现最佳,以及确定是否需要纳入RNA测序数据以获得高质量注释,我们使用12种不同方法对21种不同物种(涵盖脊椎动物、植物和昆虫)进行了注释。我们发现,注释转移方法TOGA、BRAKER3和RNA测序组装工具StringTie在包括BUSCO回收率、编码序列(CDS)长度和假阳性率等各种指标上一直表现出色,不过TOGA在一些单子叶植物的BUSCO回收率方面表现较差。选择哪种表现最佳的方法将取决于全基因组比对的可行性、RNA测序数据的可用性、捕获转录组非编码部分的重要性,以及当全基因组比对不可行时,BRAKER3和StringTie在BUSCO回收率方面的相对表现。当全基因组比对不可行时,纳入RNA测序数据将显著改善基因组注释。