Yao Zizhen, Ruzzo Walter L
Department of Computer Science and Engineering, AC101 Paul G. Allen Center, University of Washington, Seattle WA 98195, USA.
BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-7-S1-S11.
As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources.
In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems.
We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly
Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets.
随着各种功能基因组学和蛋白质组学技术的出现,对整合异构数据源的功能分析方法的需求日益增加。
在本文中,我们通过提出一种基于k近邻(KNN)算法的基因功能预测通用框架来解决这个问题。选择KNN的动机在于其简单性、能够灵活纳入不同数据类型以及对不规则特征空间的适应性。传统KNN方法的一个弱点,尤其是在处理异构数据时,是性能取决于相似性度量的通常临时选择。为了解决这个弱点,我们应用回归方法来推断相似性度量,作为一组基本相似性度量的加权组合,这有助于定位最有可能与目标基因属于同一类别的邻居。我们还提出了一种新颖的投票方案来生成置信度分数,以估计预测的准确性。该方法可以优雅地扩展到多分类问题。
我们根据生物学家提出的三种著名的大肠杆菌分类方案,使用从微阵列和基因组测序数据中获得的信息,将这项技术应用于基因功能预测。我们证明,我们的算法显著优于朴素KNN方法,并且在整合异构数据方面与支持向量机(SVM)算法具有竞争力。我们还表明,通过组合不同的数据源,预测准确性可以显著提高。
我们对KNN的扩展,包括自动特征加权、多类预测和概率推理,在保持高效、直观和灵活的同时,显著提高了预测准确性。这个通用框架也可以应用于涉及异构数据集的类似分类问题。