转录组数据独立成分分析的最优维度选择。

Optimal dimensionality selection for independent component analysis of transcriptomic data.

机构信息

University of California San Diego, San Diego, USA.

出版信息

BMC Bioinformatics. 2021 Dec 8;22(1):584. doi: 10.1186/s12859-021-04497-7.

DOI:10.1186/s12859-021-04497-7

PMID:34879815

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8653613/

Abstract

BACKGROUND

Independent component analysis is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, independent component analysis effectively reveals both the source signals of the transcriptome as co-regulated gene sets, and the activity levels of the underlying regulators across diverse experimental conditions. Two major variables that affect the final gene sets are the diversity of the expression profiles contained in the underlying data, and the user-defined number of independent components, or dimensionality, to compute. Availability of high-quality transcriptomic datasets has grown exponentially as high-throughput technologies have advanced; however, optimal dimensionality selection remains an open question.

METHODS

We computed independent components across a range of dimensionalities for four gene expression datasets with varying dimensions (both in terms of number of genes and number of samples). We computed the correlation between independent components across different dimensionalities to understand how the overall structure evolves as the number of user-defined components increases. We then measured how well the resulting gene clusters reflected known regulatory mechanisms, and developed a set of metrics to assess the accuracy of the decomposition at a given dimension.

RESULTS

We found that over-decomposition results in many independent components dominated by a single gene, whereas under-decomposition results in independent components that poorly capture the known regulatory structure. From these results, we developed a new method, called OptICA, for finding the optimal dimensionality that controls for both over- and under-decomposition. Specifically, OptICA selects the highest dimension that produces a low number of components that are dominated by a single gene. We show that OptICA outperforms two previously proposed methods for selecting the number of independent components across four transcriptomic databases of varying sizes.

CONCLUSIONS

OptICA avoids both over-decomposition and under-decomposition of transcriptomic datasets resulting in the best representation of the organism's underlying transcriptional regulatory network.

摘要

背景

独立成分分析是一种无监督机器学习算法，可将一组混合信号分离成一组统计上独立的源信号。将其应用于高质量基因表达数据集，可有效揭示转录组的源信号作为共同调节基因集，以及在不同实验条件下潜在调节因子的活性水平。影响最终基因集的两个主要变量是基础数据中包含的表达谱的多样性，以及要计算的独立成分（或维度）的用户定义数量。随着高通量技术的进步，高质量转录组数据集的可用性呈指数级增长；然而，最佳维度选择仍然是一个悬而未决的问题。

方法

我们针对四个具有不同维度（基因数量和样本数量）的基因表达数据集，在一系列维度上计算独立成分。我们计算了不同维度之间的独立成分之间的相关性，以了解随着用户定义组件数量的增加，整体结构如何演变。然后，我们测量了由此产生的基因簇如何反映已知的调节机制，并开发了一组指标来评估在给定维度下分解的准确性。

结果

我们发现过度分解会导致许多由单个基因主导的独立成分，而欠分解则会导致独立成分无法很好地捕捉已知的调节结构。根据这些结果，我们开发了一种新的方法，称为 OptICA，用于找到控制过度和欠分解的最佳维度。具体来说，OptICA 选择产生数量较少的组件的最高维度，这些组件由单个基因主导。我们表明，OptICA 在四个不同大小的转录组数据库中选择独立成分数量的两种先前提出的方法表现更好。

结论

OptICA 避免了转录组数据集的过度分解和欠分解，从而最佳地表示了生物体潜在的转录调节网络。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54eb/8653613/d2840d9f5137/12859_2021_4497_Fig1_HTML.jpg

相似文献

Optimal dimensionality selection for independent component analysis of transcriptomic data.转录组数据独立成分分析的最优维度选择。

BMC Bioinformatics. 2021 Dec 8;22(1):584. doi: 10.1186/s12859-021-04497-7.

Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data.探索降维、迁移学习和正则化方法的组合，用于利用转录组数据预测二元表型。

BMC Bioinformatics. 2024 Apr 26;25(1):167. doi: 10.1186/s12859-024-05795-6.

The Escherichia coli transcriptome mostly consists of independently regulated modules.大肠杆菌转录组主要由独立调控的模块组成。

Nat Commun. 2019 Dec 4;10(1):5536. doi: 10.1038/s41467-019-13483-w.

Determining the optimal number of independent components for reproducible transcriptomic data analysis.确定用于可重复转录组数据分析的独立成分的最佳数量。

BMC Genomics. 2017 Sep 11;18(1):712. doi: 10.1186/s12864-017-4112-9.

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations.使用多个潜在空间维度压缩基因表达数据可学习互补的生物学表现形式。

Genome Biol. 2020 May 11;21(1):109. doi: 10.1186/s13059-020-02021-3.

Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite.从多因素表达数据推断和分析基因调控网络：一个完整的交互式套件。

BMC Genomics. 2021 May 26;22(1):387. doi: 10.1186/s12864-021-07659-2.

Data-driven human transcriptomic modules determined by independent component analysis.基于独立成分分析的人类转录组模块的数据分析。

BMC Bioinformatics. 2018 Sep 17;19(1):327. doi: 10.1186/s12859-018-2338-4.

Meta-analysis of cell- specific transcriptomic data using fuzzy c-means clustering discovers versatile viral responsive genes.使用模糊c均值聚类对细胞特异性转录组数据进行荟萃分析，发现了多种病毒反应基因。

BMC Bioinformatics. 2017 Jun 6;18(1):295. doi: 10.1186/s12859-017-1669-x.

Trimming of mammalian transcriptional networks using network component analysis.使用网络组件分析修剪哺乳动物转录网络。

BMC Bioinformatics. 2010 Oct 13;11:511. doi: 10.1186/1471-2105-11-511.

MICRAT: a novel algorithm for inferring gene regulatory networks using time series gene expression data.MICRAT：一种使用时间序列基因表达数据推断基因调控网络的新算法。

BMC Syst Biol. 2018 Dec 14;12(Suppl 7):115. doi: 10.1186/s12918-018-0635-1.

引用本文的文献

Deciphering the proteome of K-12: Integrating transcriptomics and machine learning to annotate hypothetical proteins.解析K-12的蛋白质组：整合转录组学与机器学习以注释假设蛋白质。

Comput Struct Biotechnol J. 2025 Jul 24;27:3565-3578. doi: 10.1016/j.csbj.2025.07.036. eCollection 2025.

Interpreting roles of mutations associated with the emergence of USA300 strains using transcriptional regulatory network reconstruction.利用转录调控网络重建诠释与USA300菌株出现相关的突变作用。

Elife. 2025 Apr 30;12:RP90668. doi: 10.7554/eLife.90668.

Revealing systematic changes in the transcriptome during the transition from exponential growth to stationary phase.揭示从指数生长期到稳定期转变过程中转录组的系统性变化。

mSystems. 2025 Jan 21;10(1):e0131524. doi: 10.1128/msystems.01315-24. Epub 2024 Dec 23.

iModulonMiner and PyModulon: Software for unsupervised mining of gene expression compendia.iModulonMiner 和 PyModulon：用于非监督挖掘基因表达编目的软件。

PLoS Comput Biol. 2024 Oct 23;20(10):e1012546. doi: 10.1371/journal.pcbi.1012546. eCollection 2024 Oct.

Machine learning reveals the transcriptional regulatory network and circadian dynamics of PCC 7942.机器学习揭示了 PCC 7942 的转录调控网络和昼夜节律动态。

Proc Natl Acad Sci U S A. 2024 Sep 17;121(38):e2410492121. doi: 10.1073/pnas.2410492121. Epub 2024 Sep 13.

Machine-Learning Analysis of Streptomyces coelicolor Transcriptomes Reveals a Transcription Regulatory Network Encompassing Biosynthetic Gene Clusters.链霉菌转录组的机器学习分析揭示了一个包含生物合成基因簇的转录调控网络。

Adv Sci (Weinh). 2024 Nov;11(41):e2403912. doi: 10.1002/advs.202403912. Epub 2024 Sep 12.

Systematic elucidation of independently modulated genes in Lactiplantibacillus plantarum reveals a trade-off between secondary and primary metabolism.系统阐明植物乳杆菌中独立调控的基因揭示了次级代谢与初级代谢之间的权衡。

Microb Biotechnol. 2024 Feb;17(2):e14425. doi: 10.1111/1751-7915.14425.

Reconstructing the transcriptional regulatory network of probiotic is enabled by transcriptomics and machine learning.基于转录组学和机器学习来重建益生菌的转录调控网络。

mSystems. 2024 Mar 19;9(3):e0125723. doi: 10.1128/msystems.01257-23. Epub 2024 Feb 13.

Machine learning analysis of RB-TnSeq fitness data predicts functional gene modules in KT2440.基于 RB-TnSeq 适应度数据的机器学习分析预测了 KT2440 中的功能基因模块。

mSystems. 2024 Mar 19;9(3):e0094223. doi: 10.1128/msystems.00942-23. Epub 2024 Feb 7.

AutoTransOP: translating omics signatures without orthologue requirements using deep learning.AutoTransOP：使用深度学习在无需直系同源物要求的情况下进行组学特征的转换。

NPJ Syst Biol Appl. 2024 Jan 29;10(1):13. doi: 10.1038/s41540-024-00341-9.

本文引用的文献

Machine learning uncovers independently regulated modules in the Bacillus subtilis transcriptome.机器学习揭示枯草芽孢杆菌转录组中独立调控的模块。

Nat Commun. 2020 Dec 11;11(1):6338. doi: 10.1038/s41467-020-20153-9.

Revealing 29 sets of independently modulated genes in , their regulators, and role in key physiological response.揭示了其中29组独立调控的基因、它们的调控因子以及在关键生理反应中的作用。

Proc Natl Acad Sci U S A. 2020 Jul 21;117(29):17228-17239. doi: 10.1073/pnas.2008413117. Epub 2020 Jul 2.

Genome Biol. 2020 May 11;21(1):109. doi: 10.1186/s13059-020-02021-3.

The Escherichia coli transcriptome mostly consists of independently regulated modules.大肠杆菌转录组主要由独立调控的模块组成。

Nat Commun. 2019 Dec 4;10(1):5536. doi: 10.1038/s41467-019-13483-w.

A comprehensive evaluation of module detection methods for gene expression data.基因表达数据模块检测方法的综合评估

Nat Commun. 2018 Mar 15;9(1):1090. doi: 10.1038/s41467-018-03424-4.

Determining the optimal number of independent components for reproducible transcriptomic data analysis.确定用于可重复转录组数据分析的独立成分的最佳数量。

BMC Genomics. 2017 Sep 11;18(1):712. doi: 10.1186/s12864-017-4112-9.

Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes.独立成分分析揭示了膀胱肿瘤转录组的全貌，并揭示了管腔型和基底型亚型的相关见解。

Cell Rep. 2014 Nov 20;9(4):1235-45. doi: 10.1016/j.celrep.2014.10.035. Epub 2014 Nov 13.

Coherent functional modules improve transcription factor target identification, cooperativity prediction, and disease association.连贯的功能模块可改进转录因子靶点识别、协同性预测及疾病关联。

PLoS Genet. 2014 Feb 6;10(2):e1004122. doi: 10.1371/journal.pgen.1004122. eCollection 2014 Feb.

NCBI GEO: archive for functional genomics data sets--update.NCBI GEO：功能基因组学数据集存档 - 更新。

Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27.

Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis.条件依赖性转录组揭示了枯草芽孢杆菌中的高级调控架构。

Science. 2012 Mar 2;335(6072):1103-6. doi: 10.1126/science.1206848.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

转录组数据独立成分分析的最优维度选择。

Optimal dimensionality selection for independent component analysis of transcriptomic data.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献