Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA.
BMC Bioinformatics. 2010 Jan 8;11:15. doi: 10.1186/1471-2105-11-15.
A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism.
To quantitatively validate methods for pathway prediction, we developed a large "gold standard" dataset of 5,610 pathway instances known to be present or absent in curated metabolic pathway databases for six organisms. We defined a collection of 123 pathway features, whose information content we evaluated with respect to the gold standard. Feature data were used as input to an extensive collection of machine learning (ML) methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods. We compared the ML methods to the previous PathoLogic algorithm for pathway prediction using the gold standard dataset. We found that ML-based prediction methods can match the performance of the PathoLogic algorithm. PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways.
ML methods for pathway prediction perform as well as existing methods, and have qualitative advantages in terms of extensibility, tunability, and explainability. More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. However, pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations.
系统生物学的一个关键挑战是根据其基因组序列重建生物体的代谢网络。解决此问题的一种策略是根据生物体的注释基因组,预测参考数据库中已知途径的哪些代谢途径存在于生物体中。
为了定量验证途径预测方法,我们开发了一个包含 5610 个途径实例的大型“黄金标准”数据集,这些途径实例已知存在于或不存在于六个生物体的已编目代谢途径数据库中。我们定义了一组 123 个途径特征,我们根据黄金标准评估了它们的信息量。特征数据被用作大量机器学习 (ML) 方法的输入,包括朴素贝叶斯、决策树和逻辑回归,以及特征选择和集成方法。我们将 ML 方法与之前用于途径预测的 PathoLogic 算法使用黄金标准数据集进行了比较。我们发现基于 ML 的预测方法可以与 PathoLogic 算法的性能相匹配。PathoLogic 的准确率为 91%,F 度量为 0.786。基于 ML 的预测方法的准确率高达 91.2%,F 度量高达 0.787。基于 ML 的方法为每个预测途径输出一个概率,而 PathoLogic 则没有,这为用户提供了更多信息,并便于过滤预测途径。
用于途径预测的 ML 方法的性能与现有方法相当,并且在可扩展性、可调整性和可解释性方面具有定性优势。更先进的预测方法和/或更复杂的输入特征可能会提高 ML 方法的性能。然而,途径预测性能似乎主要受到根据基因组注释正确将酶与它们催化的反应相匹配的能力的限制。