Nevers Yannis, Warwick Vesztrocy Alex, Rossier Victor, Train Clément-Marie, Altenhoff Adrian, Dessimoz Christophe, Glover Natasha M
Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Nat Biotechnol. 2025 Jan;43(1):124-133. doi: 10.1038/s41587-024-02147-w. Epub 2024 Feb 21.
In the era of biodiversity genomics, it is crucial to ensure that annotations of protein-coding gene repertoires are accurate. State-of-the-art tools to assess genome annotations measure the completeness of a gene repertoire but are blind to other errors, such as gene overprediction or contamination. We introduce OMArk, a software package that relies on fast, alignment-free sequence comparisons between a query proteome and precomputed gene families across the tree of life. OMArk assesses not only the completeness but also the consistency of the gene repertoire as a whole relative to closely related species and reports likely contamination events. Analysis of 1,805 UniProt Eukaryotic Reference Proteomes with OMArk demonstrated strong evidence of contamination in 73 proteomes and identified error propagation in avian gene annotation resulting from the use of a fragmented zebra finch proteome as a reference. This study illustrates the importance of comparing and prioritizing proteomes based on their quality measures.
在生物多样性基因组学时代,确保蛋白质编码基因库注释的准确性至关重要。评估基因组注释的先进工具可衡量基因库的完整性,但对其他错误(如基因过度预测或污染)视而不见。我们引入了OMArk,这是一个软件包,它依赖于查询蛋白质组与生命之树中预先计算的基因家族之间快速、无需比对的序列比较。OMArk不仅评估基因库的完整性,还评估整个基因库相对于近缘物种的一致性,并报告可能的污染事件。使用OMArk对1805个UniProt真核生物参考蛋白质组进行分析,结果表明73个蛋白质组存在污染的有力证据,并确定了由于使用碎片化的斑胸草雀蛋白质组作为参考而导致的鸟类基因注释中的错误传播。这项研究说明了根据蛋白质组的质量指标进行比较和排序的重要性。