使用共识决策的自动化蛋白质分类

Automated protein classification using consensus decision.

作者信息

Can Tolga, Camoğlu Orhan, Singh Ambuj K, Wang Yuan-Fang

机构信息

Department of Computer Science, University of California at Santa Barbara, 93106, USA.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2004:224-35. doi: 10.1109/csb.2004.1332436.

DOI:10.1109/csb.2004.1332436

PMID:16448016

Abstract

We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. High accuracy is achieved by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure, using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers' error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level.

摘要

我们提出了一种新颖的技术，可高精度地自动生成蛋白质结构的SCOP分类。通过使用委员会（或集成）分类器的共识来组合多种方法的决策，从而实现高精度。我们的技术基于机器学习，这表明通过合理运用组件分类器，可以构建出性能优于其组件的集成分类器。我们使用两种序列和三种结构比较工具作为组件分类器。给定一个蛋白质结构，利用联合假设，我们首先确定该蛋白质是否属于SCOP层次结构中的现有类别（家族、超家族、折叠）。对于被预测为现有类别成员的蛋白质，我们使用共识分类器计算它们在家族、超家族和折叠水平上的分类。我们表明，与单个组件分类器相比，我们可以显著提高分类精度。特别是，我们在家族水平上实现的错误率比单个分类器的错误率低3至12倍，在超家族水平上低1.5至4.5倍，在折叠水平上低1.1至2.4倍。