Kairov Ulykbek, Cantini Laura, Greco Alessandro, Molkenov Askhat, Czerwinska Urszula, Barillot Emmanuel, Zinovyev Andrei
Laboratory of bioinformatics and computational systems biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Astana, Kazakhstan.
Institut Curie, PSL Research University, INSERM U900, Mines ParisTech, Paris, France.
BMC Genomics. 2017 Sep 11;18(1):712. doi: 10.1186/s12864-017-4112-9.
Independent Component Analysis (ICA) is a method that models gene expression data as an action of a set of statistically independent hidden factors. The output of ICA depends on a fundamental parameter: the number of components (factors) to compute. The optimal choice of this parameter, related to determining the effective data dimension, remains an open question in the application of blind source separation techniques to transcriptomic data.
Here we address the question of optimizing the number of statistically independent components in the analysis of transcriptomic data for reproducibility of the components in multiple runs of ICA (within the same or within varying effective dimensions) and in multiple independent datasets. To this end, we introduce ranking of independent components based on their stability in multiple ICA computation runs and define a distinguished number of components (Most Stable Transcriptome Dimension, MSTD) corresponding to the point of the qualitative change of the stability profile. Based on a large body of data, we demonstrate that a sufficient number of dimensions is required for biological interpretability of the ICA decomposition and that the most stable components with ranks below MSTD have more chances to be reproduced in independent studies compared to the less stable ones. At the same time, we show that a transcriptomics dataset can be reduced to a relatively high number of dimensions without losing the interpretability of ICA, even though higher dimensions give rise to components driven by small gene sets.
We suggest a protocol of ICA application to transcriptomics data with a possibility of prioritizing components with respect to their reproducibility that strengthens the biological interpretation. Computing too few components (much less than MSTD) is not optimal for interpretability of the results. The components ranked within MSTD range have more chances to be reproduced in independent studies.
独立成分分析(ICA)是一种将基因表达数据建模为一组统计独立的隐藏因素作用的方法。ICA的输出取决于一个基本参数:要计算的成分(因素)数量。与确定有效数据维度相关的该参数的最佳选择,在将盲源分离技术应用于转录组数据时仍是一个悬而未决的问题。
在此,我们解决了在转录组数据分析中优化统计独立成分数量的问题,以实现ICA多次运行(在相同或不同有效维度内)以及多个独立数据集中成分的可重复性。为此,我们基于独立成分在多次ICA计算运行中的稳定性引入了成分排名,并定义了一个与稳定性概况的定性变化点相对应的显著成分数量(最稳定转录组维度,MSTD)。基于大量数据,我们证明ICA分解的生物学可解释性需要足够数量的维度,并且与稳定性较差的成分相比,排名低于MSTD的最稳定成分在独立研究中更有可能被重现。同时,我们表明转录组数据集可以减少到相对较高的维度而不损失ICA的可解释性,尽管更高维度会产生由小基因集驱动的成分。
我们提出了一种将ICA应用于转录组数据的方案,该方案有可能根据成分的可重复性对其进行优先级排序,从而加强生物学解释。计算过少的成分(远少于MSTD)对于结果的可解释性并非最佳。排名在MSTD范围内的成分在独立研究中更有可能被重现。