Department of Life Sciences and Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK.
Bioinformatics. 2013 Jul 1;29(13):i154-61. doi: 10.1093/bioinformatics/btt236.
Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism's metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation.
We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of ~60% in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes).
Supplementary data are available at Bioinformatics online.
序列数据库中的错误注释是自动化基因功能注释工具的一个重要障碍,这些工具广泛依赖于与具有已知功能的序列进行比较。为了改进当前的注释并防止未来错误的传播,因此需要使用序列无关的工具来协助识别错误注释的基因产物。在酶功能的情况下,每个功能分配都意味着在生物体的代谢网络中存在一个反应;可以直接从自动基因组注释中获得基因组规模代谢模型的初步近似值。因此,网络中任何明显的问题,如死胡同或断开的反应,都可以强烈表明注释错误。
我们证明,仅使用网络拓扑特征的机器学习方法可以成功预测酶注释的有效性。在三个不同的水平上进行了预测测试。在 5 倍交叉验证实验中,使用代谢网络的拓扑特征并针对经过精心整理的正确和错误酶分配数据集进行训练的随机森林,准确率高达 86%。进一步针对未见酶超家族的交叉验证表明,该分类器可以成功地外推到训练数据中存在的酶类之外。随机森林模型应用于几种自动基因组注释,在大多数情况下,当针对最近的基因组规模代谢模型进行验证时,准确率约为 60%。我们还观察到,当应用于多个物种的草案代谢网络时,预测注释质量与生物化学主要模式生物(原核生物的大肠杆菌和真核生物的智人)的系统发育距离之间存在明显的负相关。
补充数据可在 Bioinformatics 在线获得。