Matsumoto Shinya, Aisaki Ken-ichi, Kanno Jun
Teradata Division, NCR Japan, Ltd. 2-4-1 Shiba-koen, Tokyo 105-0011, Japan.
Genome Inform. 2005;16(2):183-94.
The availability of whole-genome sequence data and high-throughput techniques such as DNA microarray enable researchers to monitor the alteration of gene expression by a certain organ or tissue in a comprehensive manner. The quantity of gene expression data can be greater than 30,000 genes per one measurement, making data clustering methods for analysis essential. Biologists usually design experimental protocols so that statistical significance can be evaluated; often, they conduct experiments in triplicate to generate a mean and standard deviation. Existing clustering methods usually use these mean or median values, rather than the original data, and take significance into account by omitting data showing large standard deviations, which eliminates potentially useful information. We propose a clustering method that uses each of the triplicate data sets as a probability distribution function instead of pooling data points into a median or mean. This method permits truly unsupervised clustering of the data from DNA microarrays.
全基因组序列数据的可获得性以及诸如DNA微阵列等高通量技术,使研究人员能够全面监测特定器官或组织中基因表达的变化。每次测量的基因表达数据量可能超过30000个基因,这使得用于分析的数据聚类方法至关重要。生物学家通常设计实验方案以便能够评估统计显著性;他们常常进行三次重复实验以生成均值和标准差。现有的聚类方法通常使用这些均值或中值,而不是原始数据,并通过省略显示出较大标准差的数据来考虑显著性,这就消除了潜在的有用信息。我们提出一种聚类方法,该方法将每个三次重复数据集用作概率分布函数,而不是将数据点汇总为中值或均值。这种方法允许对来自DNA微阵列的数据进行真正的无监督聚类。