Suppr超能文献

高维、多重共线性数据集中的因果发现

Causal Discovery in High-dimensional, Multicollinear Datasets.

作者信息

Jia Minxue, Yuan Daniel Y, Lovelace Tyler C, Hu Mengying, Benos Panayiotis V

机构信息

Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.

Joint CMU-Pitt PhD Program in Computational Biology, Pittsburgh, PA, USA.

出版信息

Front Epidemiol. 2022;2. doi: 10.3389/fepid.2022.899655. Epub 2022 Sep 13.

Abstract

As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had Covid-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known Covid-19 related biological pathways.

摘要

随着高通量基因组测序技术成本的下降,其在临床研究中的应用越来越普遍。收集到的数据集通常包含数万或数十万种生物学特征,需要对其进行挖掘以提取有意义的信息。一个特别受关注的领域是发现疾病结果的潜在因果机制。在过去几十年中,因果发现算法不断发展和扩展,以推断此类关系。然而,这些算法受到维度诅咒和多重共线性的困扰。最近引入的一种非正交、通用经验贝叶斯矩阵分解方法已被证明能够从观测变量中成功推断出具有可解释结构的潜在因素。我们假设将这种策略应用于因果发现算法可以解决大多数生物医学数据集固有的高维度和共线性问题。我们在模拟数据上评估了这种策略,并将其应用于两个真实世界的数据集。在一个乳腺癌数据集中,我们在与重要临床特征相关的因素中识别出了重要的生存相关潜在因素和具有生物学意义的富集通路。在一个SARS-CoV-2数据集中,我们能够预测患者是否(1)感染了新冠病毒以及(2)是否会进入重症监护病房。此外,我们能够将因素与已知的新冠病毒相关生物学通路联系起来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b246/10910893/9813ed03612d/fepid-02-899655-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验