Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
Department of Genetics, Stanford University, Stanford, CA, 94305, USA.
BMC Bioinformatics. 2018 Sep 17;19(1):327. doi: 10.1186/s12859-018-2338-4.
Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features.
We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity (p < 0.01). The models also had higher accuracy and negative predictive value (p < 0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches.
The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently.
分析人类转录组对于推进精准医学至关重要,而 Gene Expression Omnibus(GEO)中超过 50 万个人类微阵列样本使我们能够更好地在分子水平上描述生物学过程。然而,转录组分析具有挑战性,因为数据本质上是嘈杂的和高维的。基因集分析目前被广泛用于缓解高维问题,但用户定义的基因集选择可能会导致结果出现偏差。在本文中,我们提倡在这种分析中使用固定的转录组模块集。我们将独立成分分析应用于 GEO 中大量的微阵列数据,以发现可用于机器学习的可重复转录组模块。我们在六个研究中评估了这些模块的可用性,并证明了(1)它们作为样本分类特征的使用,以及在处理小训练集时的稳健性,(2)它们在聚类样本时对数据的正则化作用,以及(3)差异表达特征的生物学相关性。
我们鉴定了 139 个可重复的转录组模块,我们称之为基本组件(FCs)。在样本少于 50 个的研究中,FC 空间分类模型的表现优于其基因空间对应模型,具有更高的敏感性(p<0.01)。对于小数据集(少于 30 个样本),模型的准确性和阴性预测值(p<0.01)也更高。此外,当数据在 FC 空间中聚类时,我们观察到批次效应减少。最后,我们发现差异表达的 FC 映射到 GO 术语,这些术语也通过传统的基于基因的方法确定。
这 139 个 FC 提供了转录组数据的生物学相关总结,它们在低样本设置下的性能表明,在这些研究中应该采用它们,以有效地利用数据。