Schachtner R, Lutter D, Knollmüller P, Tomé A M, Theis F J, Schmitz G, Stetter M, Vilda P Gómez, Lang E W
CIML/Biophysics, University of Regensburg, D-93040 Regensburg, Germany.
Bioinformatics. 2008 Aug 1;24(15):1688-97. doi: 10.1093/bioinformatics/btn245. Epub 2008 Jun 5.
Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks.
In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.
基于矩阵分解技术的现代机器学习方法,如独立成分分析(ICA)或非负矩阵分解(NMF),提供了新的高效分析工具,目前正被用于分析基因表达谱。这些探索性特征提取技术产生表达模式(ICA)或元基因(NMF)。这些提取的特征被认为是潜在调控过程的指示。它们也可应用于基因表达数据集的分类,通过将样本分组到不同类别以用于诊断目的,或将基因分组到功能类别以进一步研究相关代谢途径和调控网络。
在本研究中,我们专注于无监督矩阵分解技术,并将ICA和稀疏NMF应用于微阵列数据集。这些数据集监测人类外周血细胞从单核细胞分化为巨噬细胞过程中的基因表达水平。我们表明,这些工具能够在推导的成分矩阵中识别相关特征,并从这些基因表达谱中提取信息丰富的标记基因集。这些方法依赖于一组标记基因的联合判别能力,而不是单个标记基因。通过留一法或随机森林交叉验证得到这些标记基因集后,数据集可以很容易地被分类到相关的诊断类别中。后者对应于单核细胞与巨噬细胞,或健康人与尼曼-皮克C病患者。