使用标志性基因引导聚类的微阵列数据挖掘

Microarray data mining using landmark gene-guided clustering.

作者信息

Chopra Pankaj, Kang Jaewoo, Yang Jiong, Cho HyungJun, Kim Heenam Stanley, Lee Min-Goo

机构信息

Dept. of Computer Science and Engineering, Korea University, Seoul, Korea.

出版信息

BMC Bioinformatics. 2008 Feb 11;9:92. doi: 10.1186/1471-2105-9-92.

DOI:10.1186/1471-2105-9-92

PMID:18267003

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2262871/

Abstract

BACKGROUND

Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.

RESULTS

By applying our SigCalc algorithm to three yeast Saccharomyces cerevisiae datasets we show two results. First, we show that different sets of clusters can be generated from the same dataset using different sets of landmark genes. Each set of clusters groups genes differently and reveals new biological associations between genes that were not apparent from clustering the original microarray expression data. Second, we show that many of these new found biological associations are common across datasets. These results also provide strong evidence of a link between the choice of landmark genes and the new biological associations found in gene clusters.

CONCLUSION

We have used the SigCalc algorithm to project the microarray data onto a completely new subspace whose co-ordinates are genes (called landmark genes), known to belong to a Biological Process. The projected space is not a true vector space in mathematical terms. However, we use the term subspace to refer to one of virtually infinite numbers of projected spaces that our proposed method can produce. By changing the biological process and thus the landmark genes, we can change this subspace. We have shown how clustering on this subspace reveals new, biologically meaningful clusters which were not evident in the clusters generated by conventional methods. The R scripts (source code) are freely available under the GPL license. The source code is available [see Additional File 1] as additional material, and the latest version can be obtained at http://www4.ncsu.edu/~pchopra/landmarks.html. The code is under active development to incorporate new clustering methods and analysis.

摘要

背景

聚类是一种流行的数据探索技术，广泛应用于微阵列数据分析。然而，大多数传统的聚类算法仅生成一组与分析的生物学背景无关的聚类。这通常不足以从不同的生物学角度探索数据并获得新的见解。我们提出了一种新的聚类模型，该模型可以从单个数据集中生成多个不同聚类的版本，每个版本都突出显示给定数据集的不同方面。

结果

通过将我们的SigCalc算法应用于三个酿酒酵母数据集，我们展示了两个结果。首先，我们表明使用不同的标志性基因集可以从同一数据集中生成不同的聚类集。每组聚类对基因的分组方式不同，并揭示了在对原始微阵列表达数据进行聚类时不明显的基因之间的新生物学关联。其次，我们表明许多这些新发现的生物学关联在不同数据集之间是常见的。这些结果也有力地证明了标志性基因的选择与基因聚类中发现的新生物学关联之间的联系。

结论

我们使用SigCalc算法将微阵列数据投影到一个全新的子空间上，该子空间的坐标是已知属于一个生物学过程的基因（称为标志性基因）。从数学角度来看，投影空间不是一个真正的向量空间。然而，我们使用术语子空间来指代我们提出的方法可以产生的几乎无限数量的投影空间之一。通过改变生物学过程从而改变标志性基因，我们可以改变这个子空间。我们已经展示了在这个子空间上进行聚类如何揭示传统方法生成的聚类中不明显的新的、具有生物学意义的聚类。R脚本（源代码）可根据GPL许可免费获得。源代码作为补充材料提供[见补充文件1]，最新版本可在http://www4.ncsu.edu/~pchopra/landmarks.html获得。该代码正在积极开发中，以纳入新的聚类方法和分析。