Biomolecular Science and Engineering Program, University of California, Santa Barbara, CA 93106, USA.
BMC Bioinformatics. 2010 Mar 4;11:117. doi: 10.1186/1471-2105-11-117.
Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry.
We integrated strategies from machine learning, cartography, and graph theory into a new informatics method for automatically clustering self-organizing map ensembles of high-dimensional data. Our new method, called AutoSOME, readily identifies discrete and fuzzy data clusters without prior knowledge of cluster number or structure in diverse datasets including whole genome microarray data. Visualization of AutoSOME output using network diagrams and differential heat maps reveals unexpected variation among well-characterized cancer cell lines. Co-expression analysis of data from human embryonic and induced pluripotent stem cells using AutoSOME identifies >3400 up-regulated genes associated with pluripotency, and indicates that a recently identified protein-protein interaction network characterizing pluripotency was underestimated by a factor of four.
By effectively extracting important information from high-dimensional microarray data without prior knowledge or the need for data filtration, AutoSOME can yield systems-level insights from whole genome microarray expression studies. Due to its generality, this new method should also have practical utility for a variety of data-intensive applications, including the results of deep sequencing experiments. AutoSOME is available for download at http://jimcooperlab.mcdb.ucsb.edu/autosome webcite.
将大型高维基因表达数据集的信息内容进行聚类在“组学”生物学中具有广泛的应用。不幸的是,这些自然数据集的底层结构通常是模糊的,并且数据聚类的计算识别通常需要关于聚类数量和几何形状的知识。
我们将机器学习、制图学和图论的策略集成到一种新的信息学方法中,用于自动聚类高维数据的自组织映射集合。我们的新方法称为 AutoSOME,无需事先了解聚类数量或结构,即可轻松识别离散和模糊的数据聚类,适用于包括全基因组微阵列数据在内的各种数据集。使用网络图和差异热图可视化 AutoSOME 输出,可以揭示出特征明确的癌细胞系之间出乎意料的变化。使用 AutoSOME 对人类胚胎和诱导多能干细胞的数据进行共表达分析,鉴定出 >3400 个与多能性相关的上调基因,并表明最近确定的一个描述多能性的蛋白质-蛋白质相互作用网络被低估了四倍。
通过在无需先验知识或数据过滤的情况下从高维微阵列数据中有效提取重要信息,AutoSOME 可以从全基因组微阵列表达研究中获得系统水平的见解。由于其通用性,这种新方法也应该对各种数据密集型应用具有实际的实用价值,包括深度测序实验的结果。AutoSOME 可在 http://jimcooperlab.mcdb.ucsb.edu/autosome 上下载。