Suppr超能文献

大规模分布式聚类:一种用于基因表达数据重复测量的新算法。

Mass distributed clustering: a new algorithm for repeated measurements in gene expression data.

作者信息

Matsumoto Shinya, Aisaki Ken-ichi, Kanno Jun

机构信息

Teradata Division, NCR Japan, Ltd. 2-4-1 Shiba-koen, Tokyo 105-0011, Japan.

出版信息

Genome Inform. 2005;16(2):183-94.

Abstract

The availability of whole-genome sequence data and high-throughput techniques such as DNA microarray enable researchers to monitor the alteration of gene expression by a certain organ or tissue in a comprehensive manner. The quantity of gene expression data can be greater than 30,000 genes per one measurement, making data clustering methods for analysis essential. Biologists usually design experimental protocols so that statistical significance can be evaluated; often, they conduct experiments in triplicate to generate a mean and standard deviation. Existing clustering methods usually use these mean or median values, rather than the original data, and take significance into account by omitting data showing large standard deviations, which eliminates potentially useful information. We propose a clustering method that uses each of the triplicate data sets as a probability distribution function instead of pooling data points into a median or mean. This method permits truly unsupervised clustering of the data from DNA microarrays.

摘要

全基因组序列数据的可获得性以及诸如DNA微阵列等高通量技术,使研究人员能够全面监测特定器官或组织中基因表达的变化。每次测量的基因表达数据量可能超过30000个基因,这使得用于分析的数据聚类方法至关重要。生物学家通常设计实验方案以便能够评估统计显著性;他们常常进行三次重复实验以生成均值和标准差。现有的聚类方法通常使用这些均值或中值,而不是原始数据,并通过省略显示出较大标准差的数据来考虑显著性,这就消除了潜在的有用信息。我们提出一种聚类方法,该方法将每个三次重复数据集用作概率分布函数,而不是将数据点汇总为中值或均值。这种方法允许对来自DNA微阵列的数据进行真正的无监督聚类。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验