Ben-Dor A, Shamir R, Yakhini Z
Department of Computer Science and Engineering, University of Washington, Seattle 98105, USA.
J Comput Biol. 1999 Fall-Winter;6(3-4):281-97. doi: 10.1089/106652799318274.
Recent advances in biotechnology allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. Analysis of data produced by such experiments offers potential insight into gene function and regulatory mechanisms. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. The corresponding algorithmic problem is to cluster multicondition gene expression patterns. In this paper we describe a novel clustering algorithm that was developed for analysis of gene expression data. We define an appropriate stochastic error model on the input, and prove that under the conditions of the model, the algorithm recovers the cluster structure with high probability. The running time of the algorithm on an n-gene dataset is O[n2[log(n)]c]. We also present a practical heuristic based on the same algorithmic ideas. The heuristic was implemented and its performance is demonstrated on simulated data and on real gene expression data, with very promising results.
生物技术的最新进展使研究人员能够在不同条件下并随时间同时测量数千个基因的表达水平。对此类实验产生的数据进行分析,有望深入了解基因功能和调控机制。基因表达数据分析中的一个关键步骤是检测表现出相似表达模式的基因群。相应的算法问题是对多条件基因表达模式进行聚类。在本文中,我们描述了一种为分析基因表达数据而开发的新型聚类算法。我们在输入上定义了一个合适的随机误差模型,并证明在该模型的条件下,该算法以高概率恢复聚类结构。该算法在n基因数据集上的运行时间为O[n2[log(n)]c]。我们还基于相同的算法思想提出了一种实用的启发式方法。该启发式方法已实现,并在模拟数据和真实基因表达数据上展示了其性能,结果非常有前景。