Suppr超能文献

确定用于可重复转录组数据分析的独立成分的最佳数量。

Determining the optimal number of independent components for reproducible transcriptomic data analysis.

作者信息

Kairov Ulykbek, Cantini Laura, Greco Alessandro, Molkenov Askhat, Czerwinska Urszula, Barillot Emmanuel, Zinovyev Andrei

机构信息

Laboratory of bioinformatics and computational systems biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Astana, Kazakhstan.

Institut Curie, PSL Research University, INSERM U900, Mines ParisTech, Paris, France.

出版信息

BMC Genomics. 2017 Sep 11;18(1):712. doi: 10.1186/s12864-017-4112-9.

Abstract

BACKGROUND

Independent Component Analysis (ICA) is a method that models gene expression data as an action of a set of statistically independent hidden factors. The output of ICA depends on a fundamental parameter: the number of components (factors) to compute. The optimal choice of this parameter, related to determining the effective data dimension, remains an open question in the application of blind source separation techniques to transcriptomic data.

RESULTS

Here we address the question of optimizing the number of statistically independent components in the analysis of transcriptomic data for reproducibility of the components in multiple runs of ICA (within the same or within varying effective dimensions) and in multiple independent datasets. To this end, we introduce ranking of independent components based on their stability in multiple ICA computation runs and define a distinguished number of components (Most Stable Transcriptome Dimension, MSTD) corresponding to the point of the qualitative change of the stability profile. Based on a large body of data, we demonstrate that a sufficient number of dimensions is required for biological interpretability of the ICA decomposition and that the most stable components with ranks below MSTD have more chances to be reproduced in independent studies compared to the less stable ones. At the same time, we show that a transcriptomics dataset can be reduced to a relatively high number of dimensions without losing the interpretability of ICA, even though higher dimensions give rise to components driven by small gene sets.

CONCLUSIONS

We suggest a protocol of ICA application to transcriptomics data with a possibility of prioritizing components with respect to their reproducibility that strengthens the biological interpretation. Computing too few components (much less than MSTD) is not optimal for interpretability of the results. The components ranked within MSTD range have more chances to be reproduced in independent studies.

摘要

背景

独立成分分析(ICA)是一种将基因表达数据建模为一组统计独立的隐藏因素作用的方法。ICA的输出取决于一个基本参数:要计算的成分(因素)数量。与确定有效数据维度相关的该参数的最佳选择,在将盲源分离技术应用于转录组数据时仍是一个悬而未决的问题。

结果

在此,我们解决了在转录组数据分析中优化统计独立成分数量的问题,以实现ICA多次运行(在相同或不同有效维度内)以及多个独立数据集中成分的可重复性。为此,我们基于独立成分在多次ICA计算运行中的稳定性引入了成分排名,并定义了一个与稳定性概况的定性变化点相对应的显著成分数量(最稳定转录组维度,MSTD)。基于大量数据,我们证明ICA分解的生物学可解释性需要足够数量的维度,并且与稳定性较差的成分相比,排名低于MSTD的最稳定成分在独立研究中更有可能被重现。同时,我们表明转录组数据集可以减少到相对较高的维度而不损失ICA的可解释性,尽管更高维度会产生由小基因集驱动的成分。

结论

我们提出了一种将ICA应用于转录组数据的方案,该方案有可能根据成分的可重复性对其进行优先级排序,从而加强生物学解释。计算过少的成分(远少于MSTD)对于结果的可解释性并非最佳。排名在MSTD范围内的成分在独立研究中更有可能被重现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec80/5594474/87e50851b88d/12864_2017_4112_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验