Tang Chun, Zhang Aidong, Ramanathan Murali
Department of Computer Science and Engineering, State University of New York at Buffalo, NY 14260, USA.
Bioinformatics. 2004 Apr 12;20(6):829-38. doi: 10.1093/bioinformatics/btg486. Epub 2004 Jan 29.
DNA arrays permit rapid, large-scale screening for patterns of gene expression and simultaneously yield the expression levels of thousands of genes for samples. The number of samples is usually limited, and such datasets are very sparse in high-dimensional gene space. Furthermore, most of the genes collected may not necessarily be of interest and uncertainty about which genes are relevant makes it difficult to construct an informative gene space. Unsupervised empirical sample pattern discovery and informative genes identification of such sparse high-dimensional datasets present interesting but challenging problems.
A new model called empirical sample pattern detection (ESPD) is proposed to delineate pattern quality with informative genes. By integrating statistical metrics, data mining and machine learning techniques, this model dynamically measures and manipulates the relationship between samples and genes while conducting an iterative detection of informative space and the empirical pattern. The performance of the proposed method with various array datasets is illustrated.
DNA阵列允许对基因表达模式进行快速、大规模筛选,并同时得出样本中数千个基因的表达水平。样本数量通常有限,并且此类数据集在高维基因空间中非常稀疏。此外,收集到的大多数基因不一定是我们感兴趣的,而哪些基因是相关的不确定性使得构建一个信息丰富的基因空间变得困难。对此类稀疏高维数据集进行无监督的经验样本模式发现和信息基因识别存在有趣但具有挑战性的问题。
提出了一种名为经验样本模式检测(ESPD)的新模型,以用信息基因描绘模式质量。通过整合统计指标、数据挖掘和机器学习技术,该模型在对信息空间和经验模式进行迭代检测时,动态测量和操纵样本与基因之间的关系。文中展示了所提方法在各种阵列数据集上的性能。