使用机器学习技术挖掘蛋白质数据库。

With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins. Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties. Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous.
We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features. We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from analogous pairs, we note that significant performance gain was obtained by the inclusion of sequence and structure information. We find that the use of a linear classifier was enough to discriminate a protein pair at the family level. However, at the superfamily level, to detect remote homologous pairs was a relatively harder problem. We find that the use of nonlinear classifiers achieve significantly higher accuracies.
In this paper, we compare three different pattern classification methods on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made. Feature selection points to a "knowledge gap" in currently available functional annotations. We demonstrate how the scheme may be employed in a framework to associate an individual protein with an existing family of evolutionarily related proteins.

随着大量与蛋白质相关的信息在广泛可用的在线数据库中不断积累，应用机器学习技术变得很有意义。这些技术通过提取数据中潜在的统计规律，对未知蛋白质的功能和进化特征进行预测。这样的预测有助于缩小实验设计者为增进我们对生化特性的理解而需要搜索的范围。此前有人提出，通过人工神经网络可以实现对一对蛋白质进行比较时可计算的特征整合，从而预测它们在进化上的相关程度和同源性。

我们编制了两个蛋白质对数据集，每对蛋白质由七个不同特征来表征。对于区分远缘同源对和类似对的问题，我们对所有可能的特征组合进行了详尽搜索，注意到通过纳入序列和结构信息可显著提高性能。我们发现使用线性分类器足以在家族水平上区分蛋白质对。然而，在超家族水平上，检测远缘同源对是一个相对更难的问题。我们发现使用非线性分类器能显著提高准确率。

在本文中，我们针对两个关于检测蛋白质对之间进化和功能关系的问题，比较了三种不同的模式分类方法，并通过广泛的交叉验证和基于特征选择的研究，量化了进行此类预测时可能存在的平均极限和不确定性。特征选择指出了当前可用功能注释中的“知识空白”。我们展示了该方案如何在一个框架中用于将单个蛋白质与现有的进化相关蛋白质家族相关联。

Mining protein database using machine learning techniques.

作者信息

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献