Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Science, Xining 810001, China.
J Theor Biol. 2012 Dec 21;315:64-70. doi: 10.1016/j.jtbi.2012.09.007. Epub 2012 Sep 18.
The past decades witnessed extensive efforts to study the relationship among proteins. Particularly, sequence-based protein-protein interactions (PPIs) prediction is fundamentally important in speeding up the process of mapping interactomes of organisms. High-throughput experimental methodologies make many model organism's PPIs known, which allows us to apply machine learning methods to learn understandable rules from the available PPIs. Under the machine learning framework, the composition vectors are usually applied to encode proteins as real-value vectors. However, the composition vector value might be highly correlated to the distribution of amino acids, i.e., amino acids which are frequently observed in nature tend to have a large value of composition vectors. Thus formulation to estimate the noise induced by the background distribution of amino acids may be needed during representations. Here, we introduce two kinds of denoising composition vectors, which were successfully used in construction of phylogenetic trees, to eliminate the noise. When validating these two denoising composition vectors on Escherichia coli (E. coli), Saccharomyces cerevisiae (S. cerevisiae) and human PPIs datasets, surprisingly, the predictive performance is not improved, and even worse than non-denoised prediction. These results suggest that the noise in phylogenetic tree construction may be valuable information in PPIs prediction.
过去几十年见证了广泛的研究蛋白质之间关系的努力。特别是,基于序列的蛋白质-蛋白质相互作用(PPIs)预测对于加快生物体互作组图谱绘制的进程至关重要。高通量实验方法学使许多模式生物的 PPIs 为人所知,这使我们能够应用机器学习方法从现有的 PPIs 中学习可理解的规则。在机器学习框架下,组成向量通常用于将蛋白质编码为实值向量。然而,组成向量的值可能与氨基酸的分布高度相关,即自然界中经常观察到的氨基酸往往具有较大的组成向量值。因此,在表示过程中可能需要估计由氨基酸背景分布引起的噪声的公式。在这里,我们引入了两种去噪组成向量,它们成功地用于构建系统发育树,以消除噪声。当我们在大肠杆菌(E. coli)、酿酒酵母(S. cerevisiae)和人类 PPIs 数据集上验证这两种去噪组成向量时,令人惊讶的是,预测性能并没有提高,甚至比非去噪预测更差。这些结果表明,系统发育树构建中的噪声可能是 PPIs 预测中的有价值信息。