Dunne Michael P, Kelly Steven
Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, OX1 3RB, UK.
BMC Genomics. 2017 May 18;18(1):390. doi: 10.1186/s12864-017-3771-x.
Complete and accurate annotation of sequenced genomes is of paramount importance to their utility and analysis. Differences in gene prediction pipelines mean that genome annotations for a species can differ considerably in the quality and quantity of their predicted genes. Furthermore, genes that are present in genome sequences sometimes fail to be detected by computational gene prediction methods. Erroneously unannotated genes can lead to oversights and inaccurate assertions in biological investigations, especially for smaller-scale genome projects, which rely heavily on computational prediction.
Here we present OrthoFiller, a tool designed to address the problem of finding and adding such missing genes to genome annotations. OrthoFiller leverages information from multiple related species to identify those genes whose existence can be verified through comparison with known gene families, but which have not been predicted. By simulating missing gene annotations in real sequence datasets from both plants and fungi we demonstrate the accuracy and utility of OrthoFiller for finding missing genes and improving genome annotations. Furthermore, we show that applying OrthoFiller to existing "complete" genome annotations can identify and correct substantial numbers of erroneously missing genes in these two sets of species.
We show that significant improvements in the completeness of genome annotations can be made by leveraging information from multiple species.
对测序基因组进行完整准确的注释对于其应用和分析至关重要。基因预测流程的差异意味着一个物种的基因组注释在预测基因的质量和数量上可能有很大差异。此外,基因组序列中存在的基因有时无法通过计算基因预测方法检测到。错误地未注释基因可能导致生物学研究中的疏忽和不准确的论断,特别是对于严重依赖计算预测的小规模基因组项目。
我们在此展示了OrthoFiller,这是一种旨在解决在基因组注释中查找并添加此类缺失基因问题的工具。OrthoFiller利用来自多个相关物种的信息来识别那些通过与已知基因家族比较可以验证其存在,但尚未被预测到的基因。通过在来自植物和真菌的真实序列数据集中模拟缺失基因注释,我们证明了OrthoFiller在查找缺失基因和改进基因组注释方面的准确性和实用性。此外,我们表明将OrthoFiller应用于现有的“完整”基因组注释可以识别并纠正这两组物种中大量错误缺失的基因。
我们表明,通过利用多个物种的信息,可以显著提高基因组注释的完整性。