高维、多重共线性数据集中的因果发现

Causal Discovery in High-dimensional, Multicollinear Datasets.

作者信息

Jia Minxue, Yuan Daniel Y, Lovelace Tyler C, Hu Mengying, Benos Panayiotis V

机构信息

Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.

Joint CMU-Pitt PhD Program in Computational Biology, Pittsburgh, PA, USA.

出版信息

Front Epidemiol. 2022;2. doi: 10.3389/fepid.2022.899655. Epub 2022 Sep 13.

DOI:10.3389/fepid.2022.899655

PMID:36778756

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9910507/

Abstract

As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had Covid-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known Covid-19 related biological pathways.

摘要

随着高通量基因组测序技术成本的下降，其在临床研究中的应用越来越普遍。收集到的数据集通常包含数万或数十万种生物学特征，需要对其进行挖掘以提取有意义的信息。一个特别受关注的领域是发现疾病结果的潜在因果机制。在过去几十年中，因果发现算法不断发展和扩展，以推断此类关系。然而，这些算法受到维度诅咒和多重共线性的困扰。最近引入的一种非正交、通用经验贝叶斯矩阵分解方法已被证明能够从观测变量中成功推断出具有可解释结构的潜在因素。我们假设将这种策略应用于因果发现算法可以解决大多数生物医学数据集固有的高维度和共线性问题。我们在模拟数据上评估了这种策略，并将其应用于两个真实世界的数据集。在一个乳腺癌数据集中，我们在与重要临床特征相关的因素中识别出了重要的生存相关潜在因素和具有生物学意义的富集通路。在一个SARS-CoV-2数据集中，我们能够预测患者是否（1）感染了新冠病毒以及（2）是否会进入重症监护病房。此外，我们能够将因素与已知的新冠病毒相关生物学通路联系起来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b246/10910893/9813ed03612d/fepid-02-899655-g0001.jpg

相似文献

Causal Discovery in High-dimensional, Multicollinear Datasets.

Front Epidemiol. 2022;2. doi: 10.3389/fepid.2022.899655. Epub 2022 Sep 13.

An algorithm for direct causal learning of influences on patient outcomes.

Artif Intell Med. 2017 Jan;75:1-15. doi: 10.1016/j.artmed.2016.10.003. Epub 2016 Nov 5.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

Essential Regression: A generalizable framework for inferring causal latent factors from multi-omic datasets.

Patterns (N Y). 2022 Mar 24;3(5):100473. doi: 10.1016/j.patter.2022.100473. eCollection 2022 May 13.

Exploring matrix factorization techniques for significant genes identification of Alzheimer's disease microarray gene expression data.

BMC Bioinformatics. 2011;12 Suppl 5(Suppl 5):S7. doi: 10.1186/1471-2105-12-S5-S7. Epub 2011 Jul 27.

Comparison of strategies for scalable causal discovery of latent variable models from mixed data.

Int J Data Sci Anal. 2018;6(1):33-45. doi: 10.1007/s41060-018-0104-3. Epub 2018 Feb 6.

Mining pure, strict epistatic interactions from high-dimensional datasets: ameliorating the curse of dimensionality.

PLoS One. 2012;7(10):e46771. doi: 10.1371/journal.pone.0046771. Epub 2012 Oct 12.

Causal discoveries for high dimensional mixed data.

Stat Med. 2022 Oct 30;41(24):4924-4940. doi: 10.1002/sim.9544. Epub 2022 Aug 15.

Handling Ill-Conditioned Omics Data With Deep Probabilistic Models.

IEEE J Biomed Health Inform. 2023 Sep;27(9):4601-4610. doi: 10.1109/JBHI.2023.3279493. Epub 2023 Sep 6.

Scalable non-negative matrix tri-factorization.

BioData Min. 2017 Dec 29;10:41. doi: 10.1186/s13040-017-0160-6. eCollection 2017.

引用本文的文献

Streamlining NMR Chemical Shift Predictions for Intrinsically Disordered Proteins: Design of Ensembles with Dimensionality Reduction and Clustering.

J Chem Inf Model. 2024 Aug 26;64(16):6542-6556. doi: 10.1021/acs.jcim.4c00809. Epub 2024 Aug 5.

本文引用的文献

Empirical Bayes Matrix Factorization.

J Mach Learn Res. 2021;22.

Essential Regression: A generalizable framework for inferring causal latent factors from multi-omic datasets.

Patterns (N Y). 2022 Mar 24;3(5):100473. doi: 10.1016/j.patter.2022.100473. eCollection 2022 May 13.

clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.

Innovation (Camb). 2021 Jul 1;2(3):100141. doi: 10.1016/j.xinn.2021.100141. eCollection 2021 Aug 28.

The cytokines HGF and CXCL13 predict the severity and the mortality in COVID-19 patients.

Nat Commun. 2021 Aug 9;12(1):4888. doi: 10.1038/s41467-021-25191-5.

Platelet activation in critically ill COVID-19 patients.

Ann Intensive Care. 2021 Jul 17;11(1):113. doi: 10.1186/s13613-021-00899-1.

SARS-CoV-2-mediated dysregulation of metabolism and autophagy uncovers host-targeting antivirals.

Nat Commun. 2021 Jun 21;12(1):3818. doi: 10.1038/s41467-021-24007-w.

The regulation of protein translation and its implications for cancer.

Signal Transduct Target Ther. 2021 Feb 18;6(1):68. doi: 10.1038/s41392-020-00444-9.

Large-Scale Multi-omic Analysis of COVID-19 Severity.

Cell Syst. 2021 Jan 20;12(1):23-40.e7. doi: 10.1016/j.cels.2020.10.003. Epub 2020 Oct 8.

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations.

Genome Biol. 2020 May 11;21(1):109. doi: 10.1186/s13059-020-02021-3.

An mRNA-mRNA Interaction Couples Expression of a Virulence Factor and Its Chaperone in Listeria monocytogenes.

Cell Rep. 2020 Mar 24;30(12):4027-4040.e7. doi: 10.1016/j.celrep.2020.03.006.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

高维、多重共线性数据集中的因果发现

Causal Discovery in High-dimensional, Multicollinear Datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献