Tjaden Brian
Computer Science Department, Wellesley College, Wellesley, MA 02481, USA.
BMC Bioinformatics. 2006 Jan 12;7:17. doi: 10.1186/1471-2105-7-17.
Clustering of gene expression patterns is a well-studied technique for elucidating trends across large numbers of transcripts and for identifying likely co-regulated genes. Even the best clustering methods, however, are unlikely to provide meaningful results if too much of the data is unreliable. With the maturation of microarray technology, a wealth of research on statistical analysis of gene expression data has encouraged researchers to consider error and uncertainty in their microarray experiments, so that experiments are being performed increasingly with repeat spots per gene per chip and with repeat experiments. One of the challenges is to incorporate the measurement error information into downstream analyses of gene expression data, such as traditional clustering techniques.
In this study, a clustering approach is presented which incorporates both gene expression values and error information about the expression measurements. Using repeat expression measurements, the error of each gene expression measurement in each experiment condition is estimated, and this measurement error information is incorporated directly into the clustering algorithm. The algorithm, CORE (Clustering Of Repeat Expression data), is presented and its performance is validated using statistical measures. By using error information about gene expression measurements, the clustering approach is less sensitive to noise in the underlying data and it is able to achieve more accurate clusterings. Results are described for both synthetic expression data as well as real gene expression data from Escherichia coli and Saccharomyces cerevisiae.
The additional information provided by replicate gene expression measurements is a valuable asset in effective clustering. Gene expression profiles with high errors, as determined from repeat measurements, may be unreliable and may associate with different clusters, whereas gene expression profiles with low errors can be clustered with higher specificity. Results indicate that including error information from repeat gene expression measurements can lead to significant improvements in clustering accuracy.
基因表达模式聚类是一种经过充分研究的技术,用于阐明大量转录本的趋势并识别可能共同调控的基因。然而,如果太多数据不可靠,即使是最好的聚类方法也不太可能提供有意义的结果。随着微阵列技术的成熟,大量关于基因表达数据统计分析的研究促使研究人员在微阵列实验中考虑误差和不确定性,因此每个基因在每个芯片上越来越多地进行重复点样以及进行重复实验。其中一个挑战是将测量误差信息纳入基因表达数据的下游分析,例如传统的聚类技术。
在本研究中,提出了一种聚类方法,该方法同时纳入了基因表达值和关于表达测量的误差信息。利用重复的表达测量,估计每个实验条件下每个基因表达测量的误差,并将此测量误差信息直接纳入聚类算法。提出了CORE(重复表达数据聚类)算法,并使用统计方法验证了其性能。通过使用关于基因表达测量的误差信息,该聚类方法对基础数据中的噪声不太敏感,并且能够实现更准确的聚类。给出了合成表达数据以及来自大肠杆菌和酿酒酵母的真实基因表达数据的结果。
重复基因表达测量提供的额外信息是有效聚类中的一项宝贵资产。根据重复测量确定误差高的基因表达谱可能不可靠,可能与不同的聚类相关联,而误差低的基因表达谱可以更具特异性地聚类。结果表明,纳入重复基因表达测量的误差信息可显著提高聚类准确性。