Prieto-Baños Silvia, Nevers Yannis, Altenhoff Adrian, Warwick Vesztrocy Alex, Dessimoz Christophe, Glover Natasha M
Department of Computational Biology, University of Lausanne, Lausanne, 1015, Switzerland.
SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf365.
In silico gene annotation, the process of identifying the genes present in a genome, remains a challenging task. As genome assemblies rapidly increase, the corresponding gene models and repertoires often fall short in quality. Despite advances in annotation methods, a lack of community standards means that most published gene annotations result from ad hoc pipelines. As a result, only a few species have nearly complete and accurate gene models. This annotation quality is thought to affect downstream analyses, including orthology inference, often the first step of comparative genomics studies.
We show that different annotation methods yield markedly distinct orthology inferences. We compared orthology assignments of gene models obtained by four prominent protein-coding gene model sources: the NCBI Eukaryotic Genome Annotation Pipeline, the Ensembl Gene Annotation System, the UniProt Reference Proteomes, and Augustus 3.4 (an ab initio pipeline). We observe significant discrepancies between sources, namely in the proportion of orthologous genes per genome, the completeness of Hierarchical Orthologous Groups, and the accuracy and recall of the predicted orthologs on a standard orthology benchmark.
电子基因注释,即识别基因组中存在的基因的过程,仍然是一项具有挑战性的任务。随着基因组组装数量迅速增加,相应的基因模型和基因库在质量上往往存在不足。尽管注释方法有所进步,但缺乏社区标准意味着大多数已发表的基因注释是由临时管道生成的。因此,只有少数物种拥有近乎完整和准确的基因模型。这种注释质量被认为会影响下游分析,包括直系同源推断,而直系同源推断通常是比较基因组学研究的第一步。
我们表明,不同的注释方法会产生明显不同的直系同源推断。我们比较了由四个著名的蛋白质编码基因模型来源获得的基因模型的直系同源分配:NCBI真核生物基因组注释管道、Ensembl基因注释系统、UniProt参考蛋白质组和Augustus 3.4(一个从头开始的管道)。我们观察到不同来源之间存在显著差异,即在每个基因组中直系同源基因的比例、分层直系同源组的完整性以及在标准直系同源基准上预测直系同源物的准确性和召回率方面。