基于朴素贝叶斯分类器的异源二聚体蛋白质复合物鉴定

Heterodimeric protein complex identification by naïve Bayes classifiers.

机构信息

Institute of Mathematics for Industry, Kyushu University, Fukuoka, Japan.

出版信息

BMC Bioinformatics. 2013 Dec 3;14:347. doi: 10.1186/1471-2105-14-347.

DOI:10.1186/1471-2105-14-347

PMID:24299017

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4219333/

Abstract

BACKGROUND

Protein complexes are basic cellular entities that carry out the functions of their components. It can be found that in databases of protein complexes of yeast like CYC2008, the major type of known protein complexes is heterodimeric complexes. Although a number of methods for trying to predict sets of proteins that form arbitrary types of protein complexes simultaneously have been proposed, it can be found that they often fail to predict heterodimeric complexes.

RESULTS

In this paper, we have designed several features characterizing heterodimeric protein complexes based on genomic data sets, and proposed a supervised-learning method for the prediction of heterodimeric protein complexes. This method learns the parameters of the features, which are embedded in the naïve Bayes classifier. The log-likelihood ratio derived from the naïve Bayes classifier with the parameter values obtained by maximum likelihood estimation gives the score of a given pair of proteins to predict whether the pair is a heterodimeric complex or not. A five-fold cross-validation shows good performance on yeast. The trained classifiers also show higher predictability than various existing algorithms on yeast data sets with approximate and exact matching criteria.

CONCLUSIONS

Heterodimeric protein complex prediction is a rather harder problem than heteromeric protein complex prediction because heterodimeric protein complex is topologically simpler. However, it turns out that by designing features specialized for heterodimeric protein complexes, predictability of them can be improved. Thus, the design of more sophisticate features for heterodimeric protein complexes as well as the accumulation of more accurate and useful genome-wide data sets will lead to higher predictability of heterodimeric protein complexes. Our tool can be downloaded from http://imi.kyushu-u.ac.jp/~om/.

摘要

背景

蛋白质复合物是执行其组成成分功能的基本细胞实体。可以发现，在像 CYC2008 这样的酵母蛋白质复合物数据库中，已知蛋白质复合物的主要类型是异源二聚体复合物。尽管已经提出了许多试图同时预测形成任意类型蛋白质复合物的蛋白质集合的方法，但可以发现它们经常无法预测异源二聚体复合物。

结果

在本文中，我们基于基因组数据集设计了几种特征来描述异源二聚体蛋白质复合物，并提出了一种用于预测异源二聚体蛋白质复合物的有监督学习方法。该方法学习特征的参数，这些参数嵌入在朴素贝叶斯分类器中。从具有通过最大似然估计获得的参数值的朴素贝叶斯分类器导出的对数似然比给出了给定蛋白质对的分数，以预测该对是否为异源二聚体复合物。五折交叉验证在酵母上表现出良好的性能。经过训练的分类器在具有近似和精确匹配标准的酵母数据集上也显示出比各种现有算法更高的可预测性。

结论

异源二聚体蛋白质复合物的预测比异源蛋白质复合物的预测更为困难，因为异源二聚体蛋白质复合物的拓扑结构更简单。然而，事实证明，通过设计专门用于异源二聚体蛋白质复合物的特征，可以提高其可预测性。因此，设计更复杂的异源二聚体蛋白质复合物特征以及积累更准确和有用的全基因组数据集将导致更高的异源二聚体蛋白质复合物的可预测性。我们的工具可以从 http://imi.kyushu-u.ac.jp/~om/ 下载。